← Back to Posts

Building RAG Systems (Part 1): Document Indexing and Vector Storage

2026-02-01 · 4 min read, 9 min code

Alright, time to get our hands dirty. In the previous article, we talked about what RAG is and why it's useful. Now let's actually build something.

The indexing phase prepares documents for search: loading, splitting into chunks, converting to vectors, and storing in a database. By the end of this article, you'll have a working indexing system that can load documents, convert them to vectors, and prepare them for retrieval.

What We're Building

Let's keep it simple to start. We're building a system that can:

  • Load documents and break them into smaller chunks that make sense
  • Convert those chunks into vectors (numbers that represent meaning)
  • Store everything in a way that makes it easy to search later

The code handles Markdown, plain text, and PDFs. If you only have Markdown files, that's fine. PDF support means you can index research papers or vendor documents without rewriting the loader.

What We'll Use

I'm using LangChain for document loading and text splitting. You could build this from scratch, but why?

For embeddings, I'm using OpenAI's models. They're good and easy to use. I'll mention open-source alternatives if you want to keep things local or save money.

For the vector database, I'm using Chroma. It's simple, runs locally, and perfect for prototyping. For production, consider Pinecone, Qdrant, or Milvus.

Getting Set Up

First things first: let's get the dependencies installed. Create a requirements.txt file with these packages:

langchain==0.1.0
langchain-openai==0.0.2
langchain-community==0.0.10
chromadb==0.4.22
python-dotenv==1.0.0
pypdf==3.17.0

Run pip install -r requirements.txt. LangChain's version numbers can be chaotic, so adjust if needed.

You'll also need an OpenAI API key. Create a .env file in your project root and add:

OPENAI_API_KEY=your_openai_api_key_here

We'll load this using python-dotenv. Get an API key from OpenAI's website. Embeddings are cheap but not free. Experimenting typically costs a few dollars.

Complete Code Implementation

Project Structure

rag_project/
├── requirements.txt
├── .env
├── documents/          # Documents to index
│   ├── sample1.md
│   └── sample2.txt
├── index.py           # Indexing script
└── vectorstore/       # Chroma database directory (auto-created)

Indexing Script Implementation

Create index.py:

Show code (186 lines, python)
import os from pathlib import Path from dotenv import load_dotenv from langchain_openai import OpenAIEmbeddings from langchain_community.document_loaders import ( TextLoader, PyPDFLoader ) from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.vectorstores import Chroma # Load environment variables load_dotenv() class RAGIndexer: """RAG System Indexer""" def __init__(self, persist_directory="./vectorstore"): """ Initialize indexer Args: persist_directory: Vector database persistence directory """ # Initialize embedding model self.embeddings = OpenAIEmbeddings( model="text-embedding-3-small" # Or use text-embedding-ada-002 ) # Initialize text splitter self.text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Maximum characters per chunk chunk_overlap=200, # Overlap characters between chunks length_function=len, ) # Vector database directory self.persist_directory = persist_directory # Vector database (lazy initialization) self.vectorstore = None def load_documents(self, directory_path): """ Load all documents from directory Args: directory_path: Document directory path Returns: List[Document]: List of loaded documents """ print(f"Loading documents from: {directory_path}") # Supported document types loaders = { '.txt': TextLoader, '.md': TextLoader, '.pdf': PyPDFLoader, } all_documents = [] # Traverse directory directory = Path(directory_path) for file_path in directory.rglob('*'): if file_path.is_file(): file_ext = file_path.suffix.lower() if file_ext in loaders: try: loader = loaders[file_ext](str(file_path)) documents = loader.load() # Add metadata for doc in documents: doc.metadata['source'] = str(file_path) doc.metadata['file_name'] = file_path.name doc.metadata['file_type'] = file_ext all_documents.extend(documents) print(f" [OK] Loaded: {file_path.name} ({len(documents)} pages)") except Exception as e: print(f" [ERROR] Failed to load: {file_path.name} - {e}") print(f"Total loaded {len(all_documents)} documents\n") return all_documents def split_documents(self, documents): """ Split documents into smaller chunks Args: documents: Original document list Returns: List[Document]: List of document chunks after splitting """ print("Splitting documents...") chunks = self.text_splitter.split_documents(documents) print(f" Original documents: {len(documents)}") print(f" Chunks after splitting: {len(chunks)}") print(f" Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} characters\n") return chunks def create_vectorstore(self, chunks, recreate=False): """ Create vector database Args: chunks: Document chunk list recreate: Whether to recreate (delete old one) """ print("Generating vectors and storing...") if recreate and os.path.exists(self.persist_directory): import shutil shutil.rmtree(self.persist_directory) print(f" Deleted old vector database") # Create vector database (Chroma persists automatically) self.vectorstore = Chroma.from_documents( documents=chunks, embedding=self.embeddings, persist_directory=self.persist_directory ) print(f" Vector database created: {self.persist_directory}") print(f" Stored {len(chunks)} document chunks\n") def load_existing_vectorstore(self): """ Load existing vector database Returns: bool: Whether loading was successful """ if not os.path.exists(self.persist_directory): return False try: self.vectorstore = Chroma( persist_directory=self.persist_directory, embedding_function=self.embeddings ) print(f"Loaded existing vector database: {self.persist_directory}\n") return True except Exception as e: print(f"Failed to load vector database: {e}\n") return False def get_vectorstore(self): """Get vector database instance""" return self.vectorstore def index(self, documents_directory, recreate=False): """ Execute complete indexing workflow Args: documents_directory: Document directory recreate: Whether to recreate vector database """ print("=" * 60) print("Starting indexing workflow") print("=" * 60 + "\n") # 1. Load documents documents = self.load_documents(documents_directory) if not documents: print("No loadable documents found") return # 2. Split documents chunks = self.split_documents(documents) # 3. Create vector database self.create_vectorstore(chunks, recreate=recreate) print("=" * 60) print("Indexing complete!") print("=" * 60) def main(): """Main function""" # Create indexer indexer = RAGIndexer(persist_directory="./vectorstore") # Execute indexing documents_dir = "./documents" # Document directory # If directory doesn't exist, create sample document if not os.path.exists(documents_dir): os.makedirs(documents_dir, exist_ok=True) # Create sample document sample_doc = """# RAG System Introduction Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models. ## Core Components 1. **Document Indexing**: Convert documents to vectors and store them 2. **Similarity Retrieval**: Find relevant documents based on questions 3. **Context Augmentation**: Use retrieval results as context 4. **Answer Generation**: Generate final answers based on context ## Advantages - Can access latest information - Reduces hallucination problems - Provides traceable sources """ with open(os.path.join(documents_dir, "rag_intro.md"), "w", encoding="utf-8") as f: f.write(sample_doc) print(f"Created sample document: {documents_dir}/rag_intro.md\n") # Execute indexing (recreate=True will delete old vector database) indexer.index(documents_dir, recreate=True) # Verify indexing vectorstore = indexer.get_vectorstore() if vectorstore: # Test retrieval print("\nTesting retrieval functionality...") results = vectorstore.similarity_search("What are the advantages of RAG systems?", k=2) print(f"Found {len(results)} relevant document chunks:\n") for i, doc in enumerate(results, 1): print(f"{i}. {doc.page_content[:200]}...") print(f" Source: {doc.metadata.get('source', 'unknown')}\n") if __name__ == "__main__": main()

Walking Through the Code

1. Document Loading

def load_documents(self, directory_path):
    # Support multiple file formats
    loaders = {
        '.txt': TextLoader,
        '.md': TextLoader,
        '.pdf': PyPDFLoader,
    }

The code supports Markdown, PDFs, and plain text. The loader dictionary makes it easy to add or remove formats. PDFs can be tricky: some are text-based, others are scanned images requiring OCR. If you run into issues, try pdfplumber or OCR tools.

I add metadata to each document (source, filename) for traceability. Error handling ensures one corrupted file doesn't crash the entire indexing process.

2. Text Chunking

Documents are too long for the model, so we split them into chunks:

self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # 1000 characters per chunk
    chunk_overlap=200,    # 200 characters overlap
)

The RecursiveCharacterTextSplitter breaks on paragraph boundaries first, then sentences, then words. This keeps related information together better than fixed-size splits.

Overlap prevents concepts from being cut in half. I use 200 characters (about 20% overlap), but adjust based on your documents. There's no perfect chunk size: too small loses context, too large includes irrelevant information. Start with 1000 characters and tune based on retrieval results.

3. Vectorization

self.embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

I'm using text-embedding-3-small for a good cost-quality balance. The -large version is more accurate but costs more. For local deployment or to avoid API costs, use open-source models:

from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

I've used this in production. It's not quite as good as OpenAI's models, but close enough for most cases.

4. Vector Storage

self.vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=self.embeddings,
    persist_directory=self.persist_directory
)

Chroma is simple, runs locally, and persists to disk. When you restart, it loads the existing database instead of re-indexing. For production, consider Pinecone, Weaviate, Qdrant, Milvus, or PostgreSQL with pgvector.

Running Example

1. Prepare Documents

Create documents/ directory and add documents:

mkdir documents
# Add your document files

2. Run Indexing Script

python index.py

Output Example:

============================================================
Starting indexing workflow
============================================================

Loading documents from: ./documents
  [OK] Loaded: rag_intro.md (1 pages)
  [OK] Loaded: api_docs.txt (1 pages)
Total loaded 2 documents

Splitting documents...
  Original documents: 2
  Chunks after splitting: 5
  Average chunk size: 856 characters

Generating vectors and storing...
  Vector database created: ./vectorstore
  Stored 5 document chunks

============================================================
Indexing complete!
============================================================

Testing retrieval functionality...
Found 2 relevant document chunks:

1. What are the advantages of RAG systems? Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models...
   Source: ./documents/rag_intro.md

2. Core components include document indexing, similarity retrieval, context augmentation, and answer generation...
   Source: ./documents/rag_intro.md

Common Issues and Debugging

1. API Key Error

Error: Invalid API key

Solution:

  • Check OPENAI_API_KEY in .env file
  • Ensure API key is valid and has sufficient credits

2. Document Loading Failure

[ERROR] Failed to load: document.pdf - ...

Solution:

  • Check if file format is supported
  • PDF files may need pypdf or pdfplumber
  • Try manual parsing and convert to text

3. Vector Database Already Exists

If vector database already exists, running again will append data. To recreate:

indexer.index(documents_dir, recreate=True)

4. Out of Memory

When processing large volumes of documents, may run out of memory:

Optimization:

  • Process documents in batches
  • Use smaller embedding models
  • Increase chunk_size to reduce number of chunks

5. Irrelevant Retrieval Results

Possible Causes:

  • Inappropriate chunking strategy
  • Poor embedding model selection
  • Poor document quality

Debugging Method:

# View retrieved documents
results = vectorstore.similarity_search("your question", k=5)
for doc in results:
    print(f"Similarity: {doc.metadata.get('score', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}")

Performance Optimization Tips

1. Batch Processing

# Batch vector generation for efficiency
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=self.embeddings,
    persist_directory=self.persist_directory,
    batch_size=100  # Batch size
)

2. Async Processing

For large document volumes, can use async:

import asyncio
from langchain_community.document_loaders import AsyncChromiumLoader

async def load_documents_async(urls):
    loader = AsyncChromiumLoader(urls)
    return await loader.aload()

3. Incremental Updates

Only index new or updated documents:

def index_incremental(self, new_documents, existing_ids):
    # Check which documents are new
    new_chunks = [chunk for chunk in chunks 
                  if chunk.metadata['doc_id'] not in existing_ids]
    # Only add new documents
    vectorstore.add_documents(new_chunks)

What We've Built

We've built a system that loads documents, splits them into chunks, converts them to vectors, and stores them in a searchable database. The code is ready to run. You'll likely need to tweak chunk sizes and handle different file formats based on your documents.

In the next article, we'll use this indexed data to answer questions: taking a user's question, finding relevant documents, and generating an answer.


This is the second article in the RAG System series. Previous: RAG System Fundamentals | Next: Building RAG Systems (Part 2)