Building RAG Systems (Part 2): Retrieval and Answer Generation

2026-02-07 · 5 min read, 12 min code

In the previous article, we set up the indexing system. We loaded documents, split them into chunks, converted them to vectors, and stored everything in Chroma. Now we need to use that indexed data to answer questions.

Here's what we're building: when someone asks a question, we'll convert it to a vector, search our database for similar documents, pull those documents together as context, and then ask an LLM to generate an answer based on that context.

What We're Starting With

From the indexing phase, we have:

A Chroma database full of vectorized document chunks
An embedding model that can convert text to vectors
A searchable database that can find similar documents

What we need to build:

A way to convert questions into vectors (using the same model)
Similarity search to find relevant documents
Context assembly to combine those documents
Prompt engineering to get good answers from the LLM
The whole thing wired together into a working system

Query Phase Architecture

Rendering diagram...

Complete Code Implementation

Query Engine Implementation

Create query.py:

Show code (210 lines, python)
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

load_dotenv()

class RAGQueryEngine:
    """RAG System Query Engine"""
    
    def __init__(self, vectorstore_path="./vectorstore"):
        """
        Initialize query engine
        
        Args:
            vectorstore_path: Vector database path
        """
        # Embedding model (consistent with indexing phase)
        self.embeddings = OpenAIEmbeddings(
            model="text-embedding-3-small"
        )
        
        # Load vector database
        self.vectorstore = Chroma(
            persist_directory=vectorstore_path,
            embedding_function=self.embeddings
        )
        
        # LLM model
        self.llm = ChatOpenAI(
            model="gpt-4o-mini",  # Or use gpt-4, gpt-3.5-turbo
            temperature=0,  # Reduce randomness, improve accuracy
        )
        
        # Retriever
        self.retriever = None
        self.qa_chain = None
        
        self._setup_retriever()
        self._setup_qa_chain()
    
    def _setup_retriever(self):
        """Setup retriever"""
        # Create retriever, return top-k most relevant documents
        self.retriever = self.vectorstore.as_retriever(
            search_type="similarity",  # Similarity search
            search_kwargs={
                "k": 4  # Return 4 most relevant document chunks
            }
        )
    
    def _setup_qa_chain(self):
        """Setup Q&A chain"""
        # Custom prompt template
        prompt_template = """Answer the user's question based on the following context information.
If you don't know the answer, say you don't know, don't make up an answer.

Context information:
{context}

Question: {question}

Please provide an accurate and detailed answer, and cite specific information from the context as much as possible."""
        
        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )
        
        # Create Q&A chain
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",  # Put all document chunks into context
            retriever=self.retriever,
            chain_type_kwargs={"prompt": PROMPT},
            return_source_documents=True,  # Return source documents
        )
    
    def query(self, question: str, return_sources: bool = True):
        """
        Query and generate answer
        
        Args:
            question: User question
            return_sources: Whether to return source documents
            
        Returns:
            dict: Dictionary containing answer and source documents
        """
        print(f"Question: {question}\n")
        
        # Execute query
        result = self.qa_chain.invoke({"query": question})
        
        answer = result["result"]
        source_documents = result.get("source_documents", [])
        
        print(f"Answer:\n{answer}\n")
        
        if return_sources and source_documents:
            print("Reference Sources:")
            for i, doc in enumerate(source_documents, 1):
                source = doc.metadata.get("source", "unknown")
                print(f"  {i}. {source}")
                print(f"     Content preview: {doc.page_content[:100]}...\n")
        
        return {
            "answer": answer,
            "sources": source_documents
        }
    
    def query_with_similarity_search(self, question: str, k: int = 4):
        """
        Use similarity search for direct retrieval (no answer generation)
        
        Args:
            question: User question
            k: Number of documents to return
            
        Returns:
            List: List of relevant documents
        """
        # Vectorize question
        question_embedding = self.embeddings.embed_query(question)
        
        # Similarity search
        docs = self.vectorstore.similarity_search(
            question,
            k=k
        )
        
        return docs
    
    def query_with_scores(self, question: str, k: int = 4):
        """
        Retrieve and return similarity scores
        
        Args:
            question: User question
            k: Number of documents to return
            
        Returns:
            List[tuple]: List of (document, similarity score) tuples
        """
        docs_with_scores = self.vectorstore.similarity_search_with_score(
            question,
            k=k
        )
        
        print(f"Retrieval Results (Similarity Scores):\n")
        for i, (doc, score) in enumerate(docs_with_scores, 1):
            print(f"{i}. Score: {score:.4f}")
            print(f"   Source: {doc.metadata.get('source', 'unknown')}")
            print(f"   Content: {doc.page_content[:150]}...\n")
        
        return docs_with_scores


class SimpleRAGSystem:
    """Simplified RAG System (without LangChain Chain)"""
    
    def __init__(self, vectorstore_path="./vectorstore"):
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma(
            persist_directory=vectorstore_path,
            embedding_function=self.embeddings
        )
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    
    def query(self, question: str, k: int = 4):
        """
        Manually implement RAG workflow
        
        Args:
            question: User question
            k: Number of documents to retrieve
        """
        print(f"Question: {question}\n")
        
        # 1. Retrieve relevant documents
        print("Step 1: Retrieving relevant documents...")
        relevant_docs = self.vectorstore.similarity_search(question, k=k)
        print(f"   Found {len(relevant_docs)} relevant document chunks\n")
        
        # 2. Build context
        print("Step 2: Building context...")
        context = "\n\n".join([
            f"[Document {i+1}]\n{doc.page_content}"
            for i, doc in enumerate(relevant_docs)
        ])
        print(f"   Context length: {len(context)} characters\n")
        
        # 3. Build prompt
        print("Step 3: Generating answer...")
        prompt = f"""Answer the user's question based on the following context information. If you don't know the answer, say you don't know.

Context information:
{context}

Question: {question}

Please provide an accurate and detailed answer:"""
        
        # 4. Call LLM
        response = self.llm.invoke(prompt)
        answer = response.content
        
        print(f"Answer:\n{answer}\n")
        
        # 5. Show sources
        print("Reference Sources:")
        for i, doc in enumerate(relevant_docs, 1):
            source = doc.metadata.get("source", "unknown")
            print(f"  {i}. {source}\n")
        
        return {
            "answer": answer,
            "sources": relevant_docs,
            "context": context
        }


def main():
    """Main function"""
    print("=" * 60)
    print("RAG Query System")
    print("=" * 60 + "\n")
    
    # Check if vector database exists
    if not os.path.exists("./vectorstore"):
        print("Vector database does not exist! Please run index.py first to create index.")
        return
    
    # Method 1: Use LangChain Chain (recommended)
    print("Method 1: Using LangChain RetrievalQA Chain\n")
    query_engine = RAGQueryEngine(vectorstore_path="./vectorstore")
    
    # Example questions
    questions = [
        "What are the advantages of RAG systems?",
        "What steps are included in document indexing?",
    ]
    
    for question in questions:
        result = query_engine.query(question)
        print("-" * 60 + "\n")
    
    # Method 2: Manual implementation (more flexible)
    print("\n" + "=" * 60)
    print("Method 2: Manual RAG Workflow Implementation\n")
    simple_rag = SimpleRAGSystem(vectorstore_path="./vectorstore")
    simple_rag.query("How to optimize RAG system retrieval quality?")
    
    # Method 3: View retrieval results and scores
    print("\n" + "=" * 60)
    print("Method 3: View Retrieval Similarity Scores\n")
    query_engine.query_with_scores("What is the role of vector databases?", k=3)


if __name__ == "__main__":
    main()

Breaking Down the Code

1. Question Vectorization and Retrieval

You must use the same embedding model for questions as you used for indexing. If you don't, the vectors won't be in the same space, and your similarity search will fail.

# Use the same embedding model as indexing phase
question_embedding = self.embeddings.embed_query(question)

# Similarity search
docs = self.vectorstore.similarity_search(question, k=4)

We're using vector search, which uses embeddings and semantic distance to find chunks that are conceptually similar to the user's question. This is semantic search: it understands meaning, not just keywords.

There's also BM25, a keyword-based algorithm that ranks chunks based on term frequency. BM25 is great for exact keyword matches, but it won't recognize that "furry feline companion" means "cat." Hybrid search combines both approaches for better results.

The k=4 means we're getting the top 4 most similar documents. This is a starting point. For simple questions, 2-3 documents might be enough. For complex questions, you might want 5-8. But more isn't always better: too many documents can confuse the model or hit context limits.

2. Context Building

Once you've retrieved the relevant documents, combine them into a context:

context = "\n\n".join([
    f"[Document {i+1}]\n{doc.page_content}"
    for i, doc in enumerate(relevant_docs)
])

I'm using what LangChain calls the "stuff" strategy: putting all the documents together in one context. This works fine when you have a small number of documents, but if you're retrieving a lot of chunks, you might hit context limits.

There are other strategies. Map-Reduce processes each chunk separately and then combines the results. Refine iteratively improves the answer by going through chunks one by one. For most cases, the simple "stuff" approach works fine.

3. RAG Prompt Engineering

The prompt makes a huge difference:

prompt_template = """Answer the user's question based on the following context information.
If you don't know the answer, say you don't know, don't make up an answer.

Context information:
{context}

Question: {question}

Please provide an accurate and detailed answer, and cite specific information from the context as much as possible."""

Key points: First, explicitly tell the model to base its answer on the context. Without this, models sometimes ignore the context and answer from training data. Second, tell it to say "I don't know" if the context doesn't contain the answer. This reduces hallucinations. Third, ask it to cite specific information for traceability.

4. LLM Invocation

For the LLM, I'm using GPT-4o-mini from OpenAI:

self.llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,  # Reduce randomness
)

I set temperature to 0 for consistent, factual answers. For Q&A systems, you usually want consistency.

gpt-4o-mini is a good balance. It's cheaper than GPT-4 but still gives good results. GPT-4 is more accurate if you need the best quality. GPT-3.5-turbo is the cheapest and fastest, but quality isn't quite as good. For most use cases, start with gpt-4o-mini and upgrade if needed.

Complete System Integration

Create rag_system.py to integrate indexing and querying:

Show code (71 lines, python)
import os
from index import RAGIndexer
from query import RAGQueryEngine

class CompleteRAGSystem:
    """Complete RAG System"""
    
    def __init__(self, vectorstore_path="./vectorstore"):
        self.vectorstore_path = vectorstore_path
        self.indexer = None
        self.query_engine = None
    
    def index_documents(self, documents_directory, recreate=False):
        """Index documents"""
        self.indexer = RAGIndexer(
            persist_directory=self.vectorstore_path
        )
        self.indexer.index(documents_directory, recreate=recreate)
    
    def initialize_query_engine(self):
        """Initialize query engine"""
        if not os.path.exists(self.vectorstore_path):
            raise ValueError("Vector database does not exist, please index first!")
        
        self.query_engine = RAGQueryEngine(
            vectorstore_path=self.vectorstore_path
        )
    
    def ask(self, question: str):
        """Ask a question"""
        if not self.query_engine:
            self.initialize_query_engine()
        
        return self.query_engine.query(question)


def main():
    """Complete example"""
    rag = CompleteRAGSystem()
    
    # 1. Index documents (if not already done)
    if not os.path.exists("./vectorstore"):
        print("Starting document indexing...\n")
        rag.index_documents("./documents", recreate=True)
    
    # 2. Initialize query engine
    print("\nInitializing query engine...\n")
    rag.initialize_query_engine()
    
    # 3. Interactive Q&A
    print("=" * 60)
    print("Start Q&A (type 'quit' to exit)")
    print("=" * 60 + "\n")
    
    while True:
        question = input("Your question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break
        
        if not question:
            continue
        
        print()
        rag.ask(question)
        print("-" * 60 + "\n")


if __name__ == "__main__":
    main()

Running Example

1. Ensure Indexing is Complete

# If not yet indexed, run first
python index.py

2. Run Query System

python query.py

Output Example:

============================================================
RAG Query System
============================================================

Method 1: Using LangChain RetrievalQA Chain

Question: What are the advantages of RAG systems?

Answer:
The main advantages of RAG systems include:

1. **Knowledge Real-time**: Can quickly update knowledge bases without retraining models
2. **Explainability**: Answers can be traced back to specific document sources
3. **Cost-effectiveness**: Lower cost and faster implementation compared to fine-tuning
4. **Flexibility**: Can switch knowledge bases for different domains
5. **Accuracy**: Based on real documents, reducing hallucination problems

Reference Sources:
  1. ./documents/rag_intro.md
     Content preview: What are the advantages of RAG systems? Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models...

------------------------------------------------------------

3. Interactive Q&A

python rag_system.py

Start Q&A (type 'quit' to exit)
============================================================

Your question: What is RAG?

Answer:
RAG (Retrieval-Augmented Generation) is a technique that combines external knowledge bases with large language models...

Reference Sources:
  1. ./documents/rag_intro.md
  2. ./documents/rag_architecture.md

------------------------------------------------------------

Your question: quit
Goodbye!

Optimization Tips

1. Adjust Retrieval Count

# Dynamically adjust based on question complexity
def adaptive_retrieval(self, question, base_k=4):
    # Complex questions need more context
    if len(question.split()) > 10:
        k = base_k * 2
    else:
        k = base_k
    return self.vectorstore.similarity_search(question, k=k)

2. Re-ranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Use LLM to extract most relevant parts
compressor = LLMChainExtractor.from_llm(self.llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=self.retriever
)

3. Streaming Output

def query_stream(self, question: str):
    """Stream answer generation"""
    for chunk in self.qa_chain.stream({"query": question}):
        print(chunk, end="", flush=True)

4. Multi-turn Conversation

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

qa_chain = RetrievalQA.from_chain_type(
    llm=self.llm,
    retriever=self.retriever,
    memory=memory
)

Common Issues

1. Inaccurate Answers

Possible Causes:

Retrieved documents are irrelevant
Poor prompt design
Context too long causing information loss

Solutions:

Adjust retrieval count k
Optimize prompt, clarify requirements
Use re-ranking to improve relevance

2. Answers Contain Hallucinations

Solutions:

Emphasize "based on context" in prompt
Add "if you don't know, say you don't know"
Use temperature=0 to reduce randomness

3. Slow Response Time

Optimization:

Use faster models (gpt-3.5-turbo)
Reduce retrieval count
Use async calls
Cache common questions

4. High Cost

Optimization:

Use cheaper models (gpt-4o-mini)
Reduce context length
Limit retrieval count
Use local open-source models

Evaluating RAG Systems

1. Retrieval Quality Evaluation

def evaluate_retrieval(self, questions, expected_sources):
    """Evaluate retrieval accuracy"""
    correct = 0
    for question, expected in zip(questions, expected_sources):
        docs = self.vectorstore.similarity_search(question, k=3)
        retrieved_sources = [doc.metadata['source'] for doc in docs]
        if any(src in retrieved_sources for src in expected):
            correct += 1
    return correct / len(questions)

2. Answer Quality Evaluation

Relevance: Does the answer address the question?
Accuracy: Is the answer based on document content?
Completeness: Does the answer contain sufficient information?
Traceability: Can sources be found?

What We've Built

We've built a complete RAG system with both the indexing phase and the query phase. The system can take raw documents, index them, and then answer questions based on that knowledge.

Here's what we've covered:

Converting questions to vectors and searching for similar documents
Building context from retrieved documents
Crafting prompts that get good answers from the LLM
Putting it all together into a working system

You now have something you can use. You can index your own documents, ask questions, and get answers with source citations.

This is a starting point, not a finished product. You'll need to tune things: chunk size, number of retrieved documents, prompt refinement. That's normal. Building RAG systems is iterative: you build something, test it, see what breaks, fix it, and repeat.

This is the third article in the RAG System series. Previous: Building RAG Systems (Part 1) | Next: RAG Applications, Challenges, and Advanced Patterns | Series Index: RAG System Fundamentals