Building RAG Systems (Part 1): Document Indexing and Vector Storage
2026-02-01 · 4 min read, 9 min code
Alright, time to get our hands dirty. In the previous article, we talked about what RAG is and why it's useful. Now let's actually build something.
The indexing phase prepares documents for search: loading, splitting into chunks, converting to vectors, and storing in a database. By the end of this article, you'll have a working indexing system that can load documents, convert them to vectors, and prepare them for retrieval.
What We're Building
Let's keep it simple to start. We're building a system that can:
- Load documents and break them into smaller chunks that make sense
- Convert those chunks into vectors (numbers that represent meaning)
- Store everything in a way that makes it easy to search later
The code handles Markdown, plain text, and PDFs. If you only have Markdown files, that's fine. PDF support means you can index research papers or vendor documents without rewriting the loader.
What We'll Use
I'm using LangChain for document loading and text splitting. You could build this from scratch, but why?
For embeddings, I'm using OpenAI's models. They're good and easy to use. I'll mention open-source alternatives if you want to keep things local or save money.
For the vector database, I'm using Chroma. It's simple, runs locally, and perfect for prototyping. For production, consider Pinecone, Qdrant, or Milvus.
Getting Set Up
First things first: let's get the dependencies installed. Create a requirements.txt file with these packages:
langchain==0.1.0
langchain-openai==0.0.2
langchain-community==0.0.10
chromadb==0.4.22
python-dotenv==1.0.0
pypdf==3.17.0
Run pip install -r requirements.txt. LangChain's version numbers can be chaotic, so adjust if needed.
You'll also need an OpenAI API key. Create a .env file in your project root and add:
OPENAI_API_KEY=your_openai_api_key_here
We'll load this using python-dotenv. Get an API key from OpenAI's website. Embeddings are cheap but not free. Experimenting typically costs a few dollars.
Complete Code Implementation
Project Structure
rag_project/
├── requirements.txt
├── .env
├── documents/ # Documents to index
│ ├── sample1.md
│ └── sample2.txt
├── index.py # Indexing script
└── vectorstore/ # Chroma database directory (auto-created)
Indexing Script Implementation
Create index.py:
Show code (186 lines, python)
import os
from pathlib import Path
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import (
TextLoader,
PyPDFLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
# Load environment variables
load_dotenv()
class RAGIndexer:
"""RAG System Indexer"""
def __init__(self, persist_directory="./vectorstore"):
"""
Initialize indexer
Args:
persist_directory: Vector database persistence directory
"""
# Initialize embedding model
self.embeddings = OpenAIEmbeddings(
model="text-embedding-3-small" # Or use text-embedding-ada-002
)
# Initialize text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Maximum characters per chunk
chunk_overlap=200, # Overlap characters between chunks
length_function=len,
)
# Vector database directory
self.persist_directory = persist_directory
# Vector database (lazy initialization)
self.vectorstore = None
def load_documents(self, directory_path):
"""
Load all documents from directory
Args:
directory_path: Document directory path
Returns:
List[Document]: List of loaded documents
"""
print(f"Loading documents from: {directory_path}")
# Supported document types
loaders = {
'.txt': TextLoader,
'.md': TextLoader,
'.pdf': PyPDFLoader,
}
all_documents = []
# Traverse directory
directory = Path(directory_path)
for file_path in directory.rglob('*'):
if file_path.is_file():
file_ext = file_path.suffix.lower()
if file_ext in loaders:
try:
loader = loaders[file_ext](str(file_path))
documents = loader.load()
# Add metadata
for doc in documents:
doc.metadata['source'] = str(file_path)
doc.metadata['file_name'] = file_path.name
doc.metadata['file_type'] = file_ext
all_documents.extend(documents)
print(f" [OK] Loaded: {file_path.name} ({len(documents)} pages)")
except Exception as e:
print(f" [ERROR] Failed to load: {file_path.name} - {e}")
print(f"Total loaded {len(all_documents)} documents\n")
return all_documents
def split_documents(self, documents):
"""
Split documents into smaller chunks
Args:
documents: Original document list
Returns:
List[Document]: List of document chunks after splitting
"""
print("Splitting documents...")
chunks = self.text_splitter.split_documents(documents)
print(f" Original documents: {len(documents)}")
print(f" Chunks after splitting: {len(chunks)}")
print(f" Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} characters\n")
return chunks
def create_vectorstore(self, chunks, recreate=False):
"""
Create vector database
Args:
chunks: Document chunk list
recreate: Whether to recreate (delete old one)
"""
print("Generating vectors and storing...")
if recreate and os.path.exists(self.persist_directory):
import shutil
shutil.rmtree(self.persist_directory)
print(f" Deleted old vector database")
# Create vector database (Chroma persists automatically)
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_directory
)
print(f" Vector database created: {self.persist_directory}")
print(f" Stored {len(chunks)} document chunks\n")
def load_existing_vectorstore(self):
"""
Load existing vector database
Returns:
bool: Whether loading was successful
"""
if not os.path.exists(self.persist_directory):
return False
try:
self.vectorstore = Chroma(
persist_directory=self.persist_directory,
embedding_function=self.embeddings
)
print(f"Loaded existing vector database: {self.persist_directory}\n")
return True
except Exception as e:
print(f"Failed to load vector database: {e}\n")
return False
def get_vectorstore(self):
"""Get vector database instance"""
return self.vectorstore
def index(self, documents_directory, recreate=False):
"""
Execute complete indexing workflow
Args:
documents_directory: Document directory
recreate: Whether to recreate vector database
"""
print("=" * 60)
print("Starting indexing workflow")
print("=" * 60 + "\n")
# 1. Load documents
documents = self.load_documents(documents_directory)
if not documents:
print("No loadable documents found")
return
# 2. Split documents
chunks = self.split_documents(documents)
# 3. Create vector database
self.create_vectorstore(chunks, recreate=recreate)
print("=" * 60)
print("Indexing complete!")
print("=" * 60)
def main():
"""Main function"""
# Create indexer
indexer = RAGIndexer(persist_directory="./vectorstore")
# Execute indexing
documents_dir = "./documents" # Document directory
# If directory doesn't exist, create sample document
if not os.path.exists(documents_dir):
os.makedirs(documents_dir, exist_ok=True)
# Create sample document
sample_doc = """# RAG System Introduction
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models.
## Core Components
1. **Document Indexing**: Convert documents to vectors and store them
2. **Similarity Retrieval**: Find relevant documents based on questions
3. **Context Augmentation**: Use retrieval results as context
4. **Answer Generation**: Generate final answers based on context
## Advantages
- Can access latest information
- Reduces hallucination problems
- Provides traceable sources
"""
with open(os.path.join(documents_dir, "rag_intro.md"), "w", encoding="utf-8") as f:
f.write(sample_doc)
print(f"Created sample document: {documents_dir}/rag_intro.md\n")
# Execute indexing (recreate=True will delete old vector database)
indexer.index(documents_dir, recreate=True)
# Verify indexing
vectorstore = indexer.get_vectorstore()
if vectorstore:
# Test retrieval
print("\nTesting retrieval functionality...")
results = vectorstore.similarity_search("What are the advantages of RAG systems?", k=2)
print(f"Found {len(results)} relevant document chunks:\n")
for i, doc in enumerate(results, 1):
print(f"{i}. {doc.page_content[:200]}...")
print(f" Source: {doc.metadata.get('source', 'unknown')}\n")
if __name__ == "__main__":
main()
Walking Through the Code
1. Document Loading
def load_documents(self, directory_path):
# Support multiple file formats
loaders = {
'.txt': TextLoader,
'.md': TextLoader,
'.pdf': PyPDFLoader,
}
The code supports Markdown, PDFs, and plain text. The loader dictionary makes it easy to add or remove formats. PDFs can be tricky: some are text-based, others are scanned images requiring OCR. If you run into issues, try pdfplumber or OCR tools.
I add metadata to each document (source, filename) for traceability. Error handling ensures one corrupted file doesn't crash the entire indexing process.
2. Text Chunking
Documents are too long for the model, so we split them into chunks:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # 1000 characters per chunk
chunk_overlap=200, # 200 characters overlap
)
The RecursiveCharacterTextSplitter breaks on paragraph boundaries first, then sentences, then words. This keeps related information together better than fixed-size splits.
Overlap prevents concepts from being cut in half. I use 200 characters (about 20% overlap), but adjust based on your documents. There's no perfect chunk size: too small loses context, too large includes irrelevant information. Start with 1000 characters and tune based on retrieval results.
3. Vectorization
self.embeddings = OpenAIEmbeddings(
model="text-embedding-3-small"
)
I'm using text-embedding-3-small for a good cost-quality balance. The -large version is more accurate but costs more. For local deployment or to avoid API costs, use open-source models:
from langchain_community.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
I've used this in production. It's not quite as good as OpenAI's models, but close enough for most cases.
4. Vector Storage
self.vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_directory
)
Chroma is simple, runs locally, and persists to disk. When you restart, it loads the existing database instead of re-indexing. For production, consider Pinecone, Weaviate, Qdrant, Milvus, or PostgreSQL with pgvector.
Running Example
1. Prepare Documents
Create documents/ directory and add documents:
mkdir documents
# Add your document files
2. Run Indexing Script
python index.py
Output Example:
============================================================
Starting indexing workflow
============================================================
Loading documents from: ./documents
[OK] Loaded: rag_intro.md (1 pages)
[OK] Loaded: api_docs.txt (1 pages)
Total loaded 2 documents
Splitting documents...
Original documents: 2
Chunks after splitting: 5
Average chunk size: 856 characters
Generating vectors and storing...
Vector database created: ./vectorstore
Stored 5 document chunks
============================================================
Indexing complete!
============================================================
Testing retrieval functionality...
Found 2 relevant document chunks:
1. What are the advantages of RAG systems? Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models...
Source: ./documents/rag_intro.md
2. Core components include document indexing, similarity retrieval, context augmentation, and answer generation...
Source: ./documents/rag_intro.md
Common Issues and Debugging
1. API Key Error
Error: Invalid API key
Solution:
- Check
OPENAI_API_KEYin.envfile - Ensure API key is valid and has sufficient credits
2. Document Loading Failure
[ERROR] Failed to load: document.pdf - ...
Solution:
- Check if file format is supported
- PDF files may need
pypdforpdfplumber - Try manual parsing and convert to text
3. Vector Database Already Exists
If vector database already exists, running again will append data. To recreate:
indexer.index(documents_dir, recreate=True)
4. Out of Memory
When processing large volumes of documents, may run out of memory:
Optimization:
- Process documents in batches
- Use smaller embedding models
- Increase
chunk_sizeto reduce number of chunks
5. Irrelevant Retrieval Results
Possible Causes:
- Inappropriate chunking strategy
- Poor embedding model selection
- Poor document quality
Debugging Method:
# View retrieved documents
results = vectorstore.similarity_search("your question", k=5)
for doc in results:
print(f"Similarity: {doc.metadata.get('score', 'N/A')}")
print(f"Content: {doc.page_content[:200]}")
Performance Optimization Tips
1. Batch Processing
# Batch vector generation for efficiency
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=self.embeddings,
persist_directory=self.persist_directory,
batch_size=100 # Batch size
)
2. Async Processing
For large document volumes, can use async:
import asyncio
from langchain_community.document_loaders import AsyncChromiumLoader
async def load_documents_async(urls):
loader = AsyncChromiumLoader(urls)
return await loader.aload()
3. Incremental Updates
Only index new or updated documents:
def index_incremental(self, new_documents, existing_ids):
# Check which documents are new
new_chunks = [chunk for chunk in chunks
if chunk.metadata['doc_id'] not in existing_ids]
# Only add new documents
vectorstore.add_documents(new_chunks)
What We've Built
We've built a system that loads documents, splits them into chunks, converts them to vectors, and stores them in a searchable database. The code is ready to run. You'll likely need to tweak chunk sizes and handle different file formats based on your documents.
In the next article, we'll use this indexed data to answer questions: taking a user's question, finding relevant documents, and generating an answer.
This is the second article in the RAG System series. Previous: RAG System Fundamentals | Next: Building RAG Systems (Part 2)