Building RAG Systems (Part 1): Document Indexing and Vector Storage

2026-02-01 · 4 min read, 9 min code

AI RAG Tutorial Python Indexing LangChain

Alright, time to get our hands dirty. In the previous article, we talked about what RAG is and why it's useful. Now let's actually build something.

The indexing phase prepares documents for search: loading, splitting into chunks, converting to vectors, and storing in a database. By the end of this article, you'll have a working indexing system that can load documents, convert them to vectors, and prepare them for retrieval.

What We're Building

Let's keep it simple to start. We're building a system that can:

Load documents and break them into smaller chunks that make sense
Convert those chunks into vectors (numbers that represent meaning)
Store everything in a way that makes it easy to search later

The code handles Markdown, plain text, and PDFs. If you only have Markdown files, that's fine. PDF support means you can index research papers or vendor documents without rewriting the loader.

What We'll Use

I'm using LangChain for document loading and text splitting. You could build this from scratch, but why?

For embeddings, I'm using OpenAI's models. They're good and easy to use. I'll mention open-source alternatives if you want to keep things local or save money.

For the vector database, I'm using Chroma. It's simple, runs locally, and perfect for prototyping. For production, consider Pinecone, Qdrant, or Milvus.

Getting Set Up

First things first: let's get the dependencies installed. Create a requirements.txt file with these packages:

langchain==0.1.0
langchain-openai==0.0.2
langchain-community==0.0.10
chromadb==0.4.22
python-dotenv==1.0.0
pypdf==3.17.0

Run pip install -r requirements.txt. LangChain's version numbers can be chaotic, so adjust if needed.

You'll also need an OpenAI API key. Create a .env file in your project root and add:

OPENAI_API_KEY=your_openai_api_key_here

We'll load this using python-dotenv. Get an API key from OpenAI's website. Embeddings are cheap but not free. Experimenting typically costs a few dollars.

Complete Code Implementation

Project Structure

rag_project/
├── requirements.txt
├── .env
├── documents/          # Documents to index
│   ├── sample1.md
│   └── sample2.txt
├── index.py           # Indexing script
└── vectorstore/       # Chroma database directory (auto-created)

Indexing Script Implementation

Create index.py:

Show code (186 lines, python)
import os
from pathlib import Path
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import (
    TextLoader,
    PyPDFLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

# Load environment variables
load_dotenv()

class RAGIndexer:
    """RAG System Indexer"""
    
    def __init__(self, persist_directory="./vectorstore"):
        """
        Initialize indexer
        
        Args:
            persist_directory: Vector database persistence directory
        """
        # Initialize embedding model
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small"  # Or use text-embedding-ada-002
        )
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,      # Maximum characters per chunk
            chunk_overlap=200,    # Overlap characters between chunks
            length_function=len,
        )
        
        # Vector database directory
        self.persist_directory = persist_directory
        
        # Vector database (lazy initialization)
        self.vectorstore = None
    
    def load_documents(self, directory_path):
        """
        Load all documents from directory
        
        Args:
            directory_path: Document directory path
            
        Returns:
            List[Document]: List of loaded documents
        """
        print(f"Loading documents from: {directory_path}")
        
        # Supported document types
        loaders = {
            '.txt': TextLoader,
            '.md': TextLoader,
            '.pdf': PyPDFLoader,
        }
        
        all_documents = []
        
        # Traverse directory
        directory = Path(directory_path)
        for file_path in directory.rglob('*'):
            if file_path.is_file():
                file_ext = file_path.suffix.lower()
                
                if file_ext in loaders:
                    try:
                        loader = loaders[file_ext](str(file_path))
                        documents = loader.load()
                        
                        # Add metadata
                        for doc in documents:
                            doc.metadata['source'] = str(file_path)
                            doc.metadata['file_name'] = file_path.name
                            doc.metadata['file_type'] = file_ext
                        
                        all_documents.extend(documents)
                        print(f"  [OK] Loaded: {file_path.name} ({len(documents)} pages)")
                    except Exception as e:
                        print(f"  [ERROR] Failed to load: {file_path.name} - {e}")
        
        print(f"Total loaded {len(all_documents)} documents\n")
        return all_documents
    
    def split_documents(self, documents):
        """
        Split documents into smaller chunks
        
        Args:
            documents: Original document list
            
        Returns:
            List[Document]: List of document chunks after splitting
        """
        print("Splitting documents...")
        
        chunks = self.text_splitter.split_documents(documents)
        
        print(f"  Original documents: {len(documents)}")
        print(f"  Chunks after splitting: {len(chunks)}")
        print(f"  Average chunk size: {sum(len(c.page_content) for c in chunks) / len(chunks):.0f} characters\n")
        
        return chunks
    
    def create_vectorstore(self, chunks, recreate=False):
        """
        Create vector database
        
        Args:
            chunks: Document chunk list
            recreate: Whether to recreate (delete old one)
        """
        print("Generating vectors and storing...")
        
        if recreate and os.path.exists(self.persist_directory):
            import shutil
            shutil.rmtree(self.persist_directory)
            print(f"  Deleted old vector database")
        
        # Create vector database (Chroma persists automatically)
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory
        )
        
        print(f"  Vector database created: {self.persist_directory}")
        print(f"  Stored {len(chunks)} document chunks\n")
    
    def load_existing_vectorstore(self):
        """
        Load existing vector database
        
        Returns:
            bool: Whether loading was successful
        """
        if not os.path.exists(self.persist_directory):
            return False
        
        try:
            self.vectorstore = Chroma(
                persist_directory=self.persist_directory,
                embedding_function=self.embeddings
            )
            print(f"Loaded existing vector database: {self.persist_directory}\n")
            return True
        except Exception as e:
            print(f"Failed to load vector database: {e}\n")
            return False
    
    def get_vectorstore(self):
        """Get vector database instance"""
        return self.vectorstore
    
    def index(self, documents_directory, recreate=False):
        """
        Execute complete indexing workflow
        
        Args:
            documents_directory: Document directory
            recreate: Whether to recreate vector database
        """
        print("=" * 60)
        print("Starting indexing workflow")
        print("=" * 60 + "\n")
        
        # 1. Load documents
        documents = self.load_documents(documents_directory)
        
        if not documents:
            print("No loadable documents found")
            return
        
        # 2. Split documents
        chunks = self.split_documents(documents)
        
        # 3. Create vector database
        self.create_vectorstore(chunks, recreate=recreate)
        
        print("=" * 60)
        print("Indexing complete!")
        print("=" * 60)


def main():
    """Main function"""
    # Create indexer
    indexer = RAGIndexer(persist_directory="./vectorstore")
    
    # Execute indexing
    documents_dir = "./documents"  # Document directory
    
    # If directory doesn't exist, create sample document
    if not os.path.exists(documents_dir):
        os.makedirs(documents_dir, exist_ok=True)
        
        # Create sample document
        sample_doc = """# RAG System Introduction

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models.

## Core Components

1. **Document Indexing**: Convert documents to vectors and store them
2. **Similarity Retrieval**: Find relevant documents based on questions
3. **Context Augmentation**: Use retrieval results as context
4. **Answer Generation**: Generate final answers based on context

## Advantages

- Can access latest information
- Reduces hallucination problems
- Provides traceable sources
"""
        with open(os.path.join(documents_dir, "rag_intro.md"), "w", encoding="utf-8") as f:
            f.write(sample_doc)
        
        print(f"Created sample document: {documents_dir}/rag_intro.md\n")
    
    # Execute indexing (recreate=True will delete old vector database)
    indexer.index(documents_dir, recreate=True)
    
    # Verify indexing
    vectorstore = indexer.get_vectorstore()
    if vectorstore:
        # Test retrieval
        print("\nTesting retrieval functionality...")
        results = vectorstore.similarity_search("What are the advantages of RAG systems?", k=2)
        print(f"Found {len(results)} relevant document chunks:\n")
        for i, doc in enumerate(results, 1):
            print(f"{i}. {doc.page_content[:200]}...")
            print(f"   Source: {doc.metadata.get('source', 'unknown')}\n")


if __name__ == "__main__":
    main()

Walking Through the Code

1. Document Loading

def load_documents(self, directory_path):
    # Support multiple file formats
    loaders = {
        '.txt': TextLoader,
        '.md': TextLoader,
        '.pdf': PyPDFLoader,
    }

The code supports Markdown, PDFs, and plain text. The loader dictionary makes it easy to add or remove formats. PDFs can be tricky: some are text-based, others are scanned images requiring OCR. If you run into issues, try pdfplumber or OCR tools.

I add metadata to each document (source, filename) for traceability. Error handling ensures one corrupted file doesn't crash the entire indexing process.

2. Text Chunking

Documents are too long for the model, so we split them into chunks:

self.text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # 1000 characters per chunk
    chunk_overlap=200,    # 200 characters overlap
)

The RecursiveCharacterTextSplitter breaks on paragraph boundaries first, then sentences, then words. This keeps related information together better than fixed-size splits.

Overlap prevents concepts from being cut in half. I use 200 characters (about 20% overlap), but adjust based on your documents. There's no perfect chunk size: too small loses context, too large includes irrelevant information. Start with 1000 characters and tune based on retrieval results.

3. Vectorization

self.embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

I'm using text-embedding-3-small for a good cost-quality balance. The -large version is more accurate but costs more. For local deployment or to avoid API costs, use open-source models:

from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

I've used this in production. It's not quite as good as OpenAI's models, but close enough for most cases.

4. Vector Storage

self.vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=self.embeddings,
    persist_directory=self.persist_directory
)

Chroma is simple, runs locally, and persists to disk. When you restart, it loads the existing database instead of re-indexing. For production, consider Pinecone, Weaviate, Qdrant, Milvus, or PostgreSQL with pgvector.

Running Example

1. Prepare Documents

Create documents/ directory and add documents:

mkdir documents
# Add your document files

2. Run Indexing Script

python index.py

Output Example:

============================================================
Starting indexing workflow
============================================================

Loading documents from: ./documents
  [OK] Loaded: rag_intro.md (1 pages)
  [OK] Loaded: api_docs.txt (1 pages)
Total loaded 2 documents

Splitting documents...
  Original documents: 2
  Chunks after splitting: 5
  Average chunk size: 856 characters

Generating vectors and storing...
  Vector database created: ./vectorstore
  Stored 5 document chunks

============================================================
Indexing complete!
============================================================

Testing retrieval functionality...
Found 2 relevant document chunks:

1. What are the advantages of RAG systems? Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models...
   Source: ./documents/rag_intro.md

2. Core components include document indexing, similarity retrieval, context augmentation, and answer generation...
   Source: ./documents/rag_intro.md

Common Issues and Debugging

1. API Key Error

Error: Invalid API key

Solution:

Check OPENAI_API_KEY in .env file
Ensure API key is valid and has sufficient credits

2. Document Loading Failure

[ERROR] Failed to load: document.pdf - ...

Solution:

Check if file format is supported
PDF files may need pypdf or pdfplumber
Try manual parsing and convert to text

3. Vector Database Already Exists

If vector database already exists, running again will append data. To recreate:

indexer.index(documents_dir, recreate=True)

4. Out of Memory

When processing large volumes of documents, may run out of memory:

Optimization:

Process documents in batches
Use smaller embedding models
Increase chunk_size to reduce number of chunks

5. Irrelevant Retrieval Results

Possible Causes:

Inappropriate chunking strategy
Poor embedding model selection
Poor document quality

Debugging Method:

# View retrieved documents
results = vectorstore.similarity_search("your question", k=5)
for doc in results:
    print(f"Similarity: {doc.metadata.get('score', 'N/A')}")
    print(f"Content: {doc.page_content[:200]}")

Performance Optimization Tips

1. Batch Processing

# Batch vector generation for efficiency
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=self.embeddings,
    persist_directory=self.persist_directory,
    batch_size=100  # Batch size
)

2. Async Processing

For large document volumes, can use async:

import asyncio
from langchain_community.document_loaders import AsyncChromiumLoader

async def load_documents_async(urls):
    loader = AsyncChromiumLoader(urls)
    return await loader.aload()

3. Incremental Updates

Only index new or updated documents:

def index_incremental(self, new_documents, existing_ids):
    # Check which documents are new
    new_chunks = [chunk for chunk in chunks 
                  if chunk.metadata['doc_id'] not in existing_ids]
    # Only add new documents
    vectorstore.add_documents(new_chunks)

What We've Built

We've built a system that loads documents, splits them into chunks, converts them to vectors, and stores them in a searchable database. The code is ready to run. You'll likely need to tweak chunk sizes and handle different file formats based on your documents.

In the next article, we'll use this indexed data to answer questions: taking a user's question, finding relevant documents, and generating an answer.

This is the second article in the RAG System series. Previous: RAG System Fundamentals | Next: Building RAG Systems (Part 2)