← Back to Posts

Building RAG Systems (Part 2): Retrieval and Answer Generation

2026-02-07 · 5 min read, 12 min code

In the previous article, we set up the indexing system. We loaded documents, split them into chunks, converted them to vectors, and stored everything in Chroma. Now we need to use that indexed data to answer questions.

Here's what we're building: when someone asks a question, we'll convert it to a vector, search our database for similar documents, pull those documents together as context, and then ask an LLM to generate an answer based on that context.

What We're Starting With

From the indexing phase, we have:

  • A Chroma database full of vectorized document chunks
  • An embedding model that can convert text to vectors
  • A searchable database that can find similar documents

What we need to build:

  • A way to convert questions into vectors (using the same model)
  • Similarity search to find relevant documents
  • Context assembly to combine those documents
  • Prompt engineering to get good answers from the LLM
  • The whole thing wired together into a working system

Query Phase Architecture

Rendering diagram...

Complete Code Implementation

Query Engine Implementation

Create query.py:

Show code (210 lines, python)
import os from dotenv import load_dotenv from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate load_dotenv() class RAGQueryEngine: """RAG System Query Engine""" def __init__(self, vectorstore_path="./vectorstore"): """ Initialize query engine Args: vectorstore_path: Vector database path """ # Embedding model (consistent with indexing phase) self.embeddings = OpenAIEmbeddings( model="text-embedding-3-small" ) # Load vector database self.vectorstore = Chroma( persist_directory=vectorstore_path, embedding_function=self.embeddings ) # LLM model self.llm = ChatOpenAI( model="gpt-4o-mini", # Or use gpt-4, gpt-3.5-turbo temperature=0, # Reduce randomness, improve accuracy ) # Retriever self.retriever = None self.qa_chain = None self._setup_retriever() self._setup_qa_chain() def _setup_retriever(self): """Setup retriever""" # Create retriever, return top-k most relevant documents self.retriever = self.vectorstore.as_retriever( search_type="similarity", # Similarity search search_kwargs={ "k": 4 # Return 4 most relevant document chunks } ) def _setup_qa_chain(self): """Setup Q&A chain""" # Custom prompt template prompt_template = """Answer the user's question based on the following context information. If you don't know the answer, say you don't know, don't make up an answer. Context information: {context} Question: {question} Please provide an accurate and detailed answer, and cite specific information from the context as much as possible.""" PROMPT = PromptTemplate( template=prompt_template, input_variables=["context", "question"] ) # Create Q&A chain self.qa_chain = RetrievalQA.from_chain_type( llm=self.llm, chain_type="stuff", # Put all document chunks into context retriever=self.retriever, chain_type_kwargs={"prompt": PROMPT}, return_source_documents=True, # Return source documents ) def query(self, question: str, return_sources: bool = True): """ Query and generate answer Args: question: User question return_sources: Whether to return source documents Returns: dict: Dictionary containing answer and source documents """ print(f"Question: {question}\n") # Execute query result = self.qa_chain.invoke({"query": question}) answer = result["result"] source_documents = result.get("source_documents", []) print(f"Answer:\n{answer}\n") if return_sources and source_documents: print("Reference Sources:") for i, doc in enumerate(source_documents, 1): source = doc.metadata.get("source", "unknown") print(f" {i}. {source}") print(f" Content preview: {doc.page_content[:100]}...\n") return { "answer": answer, "sources": source_documents } def query_with_similarity_search(self, question: str, k: int = 4): """ Use similarity search for direct retrieval (no answer generation) Args: question: User question k: Number of documents to return Returns: List: List of relevant documents """ # Vectorize question question_embedding = self.embeddings.embed_query(question) # Similarity search docs = self.vectorstore.similarity_search( question, k=k ) return docs def query_with_scores(self, question: str, k: int = 4): """ Retrieve and return similarity scores Args: question: User question k: Number of documents to return Returns: List[tuple]: List of (document, similarity score) tuples """ docs_with_scores = self.vectorstore.similarity_search_with_score( question, k=k ) print(f"Retrieval Results (Similarity Scores):\n") for i, (doc, score) in enumerate(docs_with_scores, 1): print(f"{i}. Score: {score:.4f}") print(f" Source: {doc.metadata.get('source', 'unknown')}") print(f" Content: {doc.page_content[:150]}...\n") return docs_with_scores class SimpleRAGSystem: """Simplified RAG System (without LangChain Chain)""" def __init__(self, vectorstore_path="./vectorstore"): self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small") self.vectorstore = Chroma( persist_directory=vectorstore_path, embedding_function=self.embeddings ) self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) def query(self, question: str, k: int = 4): """ Manually implement RAG workflow Args: question: User question k: Number of documents to retrieve """ print(f"Question: {question}\n") # 1. Retrieve relevant documents print("Step 1: Retrieving relevant documents...") relevant_docs = self.vectorstore.similarity_search(question, k=k) print(f" Found {len(relevant_docs)} relevant document chunks\n") # 2. Build context print("Step 2: Building context...") context = "\n\n".join([ f"[Document {i+1}]\n{doc.page_content}" for i, doc in enumerate(relevant_docs) ]) print(f" Context length: {len(context)} characters\n") # 3. Build prompt print("Step 3: Generating answer...") prompt = f"""Answer the user's question based on the following context information. If you don't know the answer, say you don't know. Context information: {context} Question: {question} Please provide an accurate and detailed answer:""" # 4. Call LLM response = self.llm.invoke(prompt) answer = response.content print(f"Answer:\n{answer}\n") # 5. Show sources print("Reference Sources:") for i, doc in enumerate(relevant_docs, 1): source = doc.metadata.get("source", "unknown") print(f" {i}. {source}\n") return { "answer": answer, "sources": relevant_docs, "context": context } def main(): """Main function""" print("=" * 60) print("RAG Query System") print("=" * 60 + "\n") # Check if vector database exists if not os.path.exists("./vectorstore"): print("Vector database does not exist! Please run index.py first to create index.") return # Method 1: Use LangChain Chain (recommended) print("Method 1: Using LangChain RetrievalQA Chain\n") query_engine = RAGQueryEngine(vectorstore_path="./vectorstore") # Example questions questions = [ "What are the advantages of RAG systems?", "What steps are included in document indexing?", ] for question in questions: result = query_engine.query(question) print("-" * 60 + "\n") # Method 2: Manual implementation (more flexible) print("\n" + "=" * 60) print("Method 2: Manual RAG Workflow Implementation\n") simple_rag = SimpleRAGSystem(vectorstore_path="./vectorstore") simple_rag.query("How to optimize RAG system retrieval quality?") # Method 3: View retrieval results and scores print("\n" + "=" * 60) print("Method 3: View Retrieval Similarity Scores\n") query_engine.query_with_scores("What is the role of vector databases?", k=3) if __name__ == "__main__": main()

Breaking Down the Code

1. Question Vectorization and Retrieval

You must use the same embedding model for questions as you used for indexing. If you don't, the vectors won't be in the same space, and your similarity search will fail.

# Use the same embedding model as indexing phase
question_embedding = self.embeddings.embed_query(question)

# Similarity search
docs = self.vectorstore.similarity_search(question, k=4)

We're using vector search, which uses embeddings and semantic distance to find chunks that are conceptually similar to the user's question. This is semantic search: it understands meaning, not just keywords.

There's also BM25, a keyword-based algorithm that ranks chunks based on term frequency. BM25 is great for exact keyword matches, but it won't recognize that "furry feline companion" means "cat." Hybrid search combines both approaches for better results.

The k=4 means we're getting the top 4 most similar documents. This is a starting point. For simple questions, 2-3 documents might be enough. For complex questions, you might want 5-8. But more isn't always better: too many documents can confuse the model or hit context limits.

2. Context Building

Once you've retrieved the relevant documents, combine them into a context:

context = "\n\n".join([
    f"[Document {i+1}]\n{doc.page_content}"
    for i, doc in enumerate(relevant_docs)
])

I'm using what LangChain calls the "stuff" strategy: putting all the documents together in one context. This works fine when you have a small number of documents, but if you're retrieving a lot of chunks, you might hit context limits.

There are other strategies. Map-Reduce processes each chunk separately and then combines the results. Refine iteratively improves the answer by going through chunks one by one. For most cases, the simple "stuff" approach works fine.

3. RAG Prompt Engineering

The prompt makes a huge difference:

prompt_template = """Answer the user's question based on the following context information.
If you don't know the answer, say you don't know, don't make up an answer.

Context information:
{context}

Question: {question}

Please provide an accurate and detailed answer, and cite specific information from the context as much as possible."""

Key points: First, explicitly tell the model to base its answer on the context. Without this, models sometimes ignore the context and answer from training data. Second, tell it to say "I don't know" if the context doesn't contain the answer. This reduces hallucinations. Third, ask it to cite specific information for traceability.

4. LLM Invocation

For the LLM, I'm using GPT-4o-mini from OpenAI:

self.llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,  # Reduce randomness
)

I set temperature to 0 for consistent, factual answers. For Q&A systems, you usually want consistency.

gpt-4o-mini is a good balance. It's cheaper than GPT-4 but still gives good results. GPT-4 is more accurate if you need the best quality. GPT-3.5-turbo is the cheapest and fastest, but quality isn't quite as good. For most use cases, start with gpt-4o-mini and upgrade if needed.

Complete System Integration

Create rag_system.py to integrate indexing and querying:

Show code (71 lines, python)
import os from index import RAGIndexer from query import RAGQueryEngine class CompleteRAGSystem: """Complete RAG System""" def __init__(self, vectorstore_path="./vectorstore"): self.vectorstore_path = vectorstore_path self.indexer = None self.query_engine = None def index_documents(self, documents_directory, recreate=False): """Index documents""" self.indexer = RAGIndexer( persist_directory=self.vectorstore_path ) self.indexer.index(documents_directory, recreate=recreate) def initialize_query_engine(self): """Initialize query engine""" if not os.path.exists(self.vectorstore_path): raise ValueError("Vector database does not exist, please index first!") self.query_engine = RAGQueryEngine( vectorstore_path=self.vectorstore_path ) def ask(self, question: str): """Ask a question""" if not self.query_engine: self.initialize_query_engine() return self.query_engine.query(question) def main(): """Complete example""" rag = CompleteRAGSystem() # 1. Index documents (if not already done) if not os.path.exists("./vectorstore"): print("Starting document indexing...\n") rag.index_documents("./documents", recreate=True) # 2. Initialize query engine print("\nInitializing query engine...\n") rag.initialize_query_engine() # 3. Interactive Q&A print("=" * 60) print("Start Q&A (type 'quit' to exit)") print("=" * 60 + "\n") while True: question = input("Your question: ").strip() if question.lower() in ['quit', 'exit', 'q']: print("Goodbye!") break if not question: continue print() rag.ask(question) print("-" * 60 + "\n") if __name__ == "__main__": main()

Running Example

1. Ensure Indexing is Complete

# If not yet indexed, run first
python index.py

2. Run Query System

python query.py

Output Example:

============================================================
RAG Query System
============================================================

Method 1: Using LangChain RetrievalQA Chain

Question: What are the advantages of RAG systems?

Answer:
The main advantages of RAG systems include:

1. **Knowledge Real-time**: Can quickly update knowledge bases without retraining models
2. **Explainability**: Answers can be traced back to specific document sources
3. **Cost-effectiveness**: Lower cost and faster implementation compared to fine-tuning
4. **Flexibility**: Can switch knowledge bases for different domains
5. **Accuracy**: Based on real documents, reducing hallucination problems

Reference Sources:
  1. ./documents/rag_intro.md
     Content preview: What are the advantages of RAG systems? Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative models...

------------------------------------------------------------

3. Interactive Q&A

python rag_system.py
Start Q&A (type 'quit' to exit)
============================================================

Your question: What is RAG?

Answer:
RAG (Retrieval-Augmented Generation) is a technique that combines external knowledge bases with large language models...

Reference Sources:
  1. ./documents/rag_intro.md
  2. ./documents/rag_architecture.md

------------------------------------------------------------

Your question: quit
Goodbye!

Optimization Tips

1. Adjust Retrieval Count

# Dynamically adjust based on question complexity
def adaptive_retrieval(self, question, base_k=4):
    # Complex questions need more context
    if len(question.split()) > 10:
        k = base_k * 2
    else:
        k = base_k
    return self.vectorstore.similarity_search(question, k=k)

2. Re-ranking

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Use LLM to extract most relevant parts
compressor = LLMChainExtractor.from_llm(self.llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=self.retriever
)

3. Streaming Output

def query_stream(self, question: str):
    """Stream answer generation"""
    for chunk in self.qa_chain.stream({"query": question}):
        print(chunk, end="", flush=True)

4. Multi-turn Conversation

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

qa_chain = RetrievalQA.from_chain_type(
    llm=self.llm,
    retriever=self.retriever,
    memory=memory
)

Common Issues

1. Inaccurate Answers

Possible Causes:

  • Retrieved documents are irrelevant
  • Poor prompt design
  • Context too long causing information loss

Solutions:

  • Adjust retrieval count k
  • Optimize prompt, clarify requirements
  • Use re-ranking to improve relevance

2. Answers Contain Hallucinations

Solutions:

  • Emphasize "based on context" in prompt
  • Add "if you don't know, say you don't know"
  • Use temperature=0 to reduce randomness

3. Slow Response Time

Optimization:

  • Use faster models (gpt-3.5-turbo)
  • Reduce retrieval count
  • Use async calls
  • Cache common questions

4. High Cost

Optimization:

  • Use cheaper models (gpt-4o-mini)
  • Reduce context length
  • Limit retrieval count
  • Use local open-source models

Evaluating RAG Systems

1. Retrieval Quality Evaluation

def evaluate_retrieval(self, questions, expected_sources):
    """Evaluate retrieval accuracy"""
    correct = 0
    for question, expected in zip(questions, expected_sources):
        docs = self.vectorstore.similarity_search(question, k=3)
        retrieved_sources = [doc.metadata['source'] for doc in docs]
        if any(src in retrieved_sources for src in expected):
            correct += 1
    return correct / len(questions)

2. Answer Quality Evaluation

  • Relevance: Does the answer address the question?
  • Accuracy: Is the answer based on document content?
  • Completeness: Does the answer contain sufficient information?
  • Traceability: Can sources be found?

What We've Built

We've built a complete RAG system with both the indexing phase and the query phase. The system can take raw documents, index them, and then answer questions based on that knowledge.

Here's what we've covered:

  • Converting questions to vectors and searching for similar documents
  • Building context from retrieved documents
  • Crafting prompts that get good answers from the LLM
  • Putting it all together into a working system

You now have something you can use. You can index your own documents, ask questions, and get answers with source citations.

This is a starting point, not a finished product. You'll need to tune things: chunk size, number of retrieved documents, prompt refinement. That's normal. Building RAG systems is iterative: you build something, test it, see what breaks, fix it, and repeat.


This is the third article in the RAG System series. Previous: Building RAG Systems (Part 1) | Next: RAG Applications, Challenges, and Advanced Patterns | Series Index: RAG System Fundamentals