Skip to content

RAG — Retrieval-Augmented Generation

1. Why this matters

LLMs are stuck with their training cut-off and have no access to your private data. Three options to fix that:

Option Cost When
Prompt engineering Low Tiny knowledge addition (a few hundred tokens)
RAG Medium Your data is large, changes frequently, needs citations
Fine-tuning High You need to teach style/behavior, not facts

For 80%+ of "I want a chatbot over my docs" / "Q&A over our wiki" / "search this codebase semantically" — RAG is the right tool.

2. Mental model

RAG = open-book exam for an LLM.

  1. You ask a question.
  2. The system looks up the most relevant pages from your "book" (vector store).
  3. The LLM is given those pages + the question, and produces an answer grounded in the retrieved text.
flowchart LR
    subgraph SG1 [Indexing offline, once]
      L[Load docs] --> S[Split into chunks]
      S --> E1[Embed chunks]
      E1 --> VS[Vector Store]
    end
    subgraph SG2 [Querying online, per request]
      Q[Question] --> E2[Embed query]
      E2 --> R[Retrieve top-k]
      VS -.-> R
      R --> P[Stuff into prompt]
      P --> M[LLM]
      M --> A[Answer + citations]
    end

3. Architecture / Flow

End-to-end LCEL chain shape:

flowchart LR
    Q[Question str] --> PAR[Parallel]
    PAR -->|retriever| C[context: top-k docs]
    PAR -->|passthrough| QQ[question: same str]
    C --> FMT[format_docs join with newlines]
    FMT --> PT[ChatPromptTemplate]
    QQ --> PT
    PT --> M[ChatModel]
    M --> OP[StrOutputParser]
    OP --> ANS[Final answer]

4. Core concepts

  • Indexing pipeline — runs offline. Loader → splitter → embedding → vector store. Re-runs when docs change.
  • Query pipeline — runs per request. Embed query → retrieve → format → prompt → generate → parse.
  • Grounding — telling the LLM "answer only from the context, say 'I don't know' otherwise" — biggest single thing reducing hallucinations.
  • Citations — return the source metadata of retrieved chunks alongside the answer. Build user trust + debuggability.
  • Chunking strategy matters more than model choice for retrieval quality.
  • Eval set — 20–100 question/expected-answer pairs. Run them every time you change anything (chunk size, retriever, model). Without this you're flying blind.

5. Code — minimal working example

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# ----- INDEXING (run once / when docs change) -----
docs = PyPDFLoader("./handbook.pdf").load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
).split_documents(docs)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local("./index")

# ----- QUERYING -----
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

prompt = ChatPromptTemplate.from_template(
    """Answer using ONLY the context below.
    If the answer isn't there, say "I don't know."

    Context:
    {context}

    Question: {question}
    """
)

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model="gpt-4o-mini", temperature=0)
    | StrOutputParser()
)

print(chain.invoke("What's our PTO policy?"))

6. Code — real-world pattern

Production RAG: returns the answer AND citations, supports streaming, batched retrieval, and proper telemetry.

from langchain_core.runnables import RunnableParallel, RunnableLambda
from operator import itemgetter

# 1. Better retriever — MMR + filter by tenant
retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "filter": {"tenant": "acme"}},
)

# 2. Prompt that asks for grounded answers + cites
prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a helpful assistant. Answer using only the provided context. "
     "Quote relevant sentences. If the answer isn't in the context, say so."),
    ("human",
     "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"),
])

def format_docs(docs):
    parts = []
    for i, d in enumerate(docs, 1):
        src = d.metadata.get("source", "?")
        page = d.metadata.get("page", "?")
        parts.append(f"[{i}] {d.page_content}\n(source: {src}, page {page})")
    return "\n\n".join(parts)

# 3. Build the chain — keep raw docs alongside the answer for citation
answer_chain = (
    RunnableParallel({
        "context":  retriever | RunnableLambda(format_docs),
        "question": itemgetter("question"),
    })
    | prompt
    | ChatOpenAI(model="gpt-4o-mini", temperature=0)
    | StrOutputParser()
)

rag_chain = RunnableParallel({
    "answer":  answer_chain,
    "sources": itemgetter("question") | retriever,   # raw Documents for UI
})

# 4. Use it
result = rag_chain.invoke({"question": "How many PTO days do new hires get?"})
print(result["answer"])
for d in result["sources"]:
    print("→", d.metadata.get("source"), d.metadata.get("page"))

# 5. Stream the answer for chat UIs
for chunk in answer_chain.stream({"question": "..."}):
    print(chunk, end="", flush=True)

Indexing API for incremental updates (don't re-embed unchanged docs):

from langchain.indexes import SQLRecordManager, index

record_manager = SQLRecordManager("acme/docs", db_url="sqlite:///records.db")
record_manager.create_schema()

index(
    docs_source=chunks,           # output of splitter
    record_manager=record_manager,
    vector_store=vectorstore,
    cleanup="incremental",        # delete chunks whose source disappeared
    source_id_key="source",
)

7. Common pitfalls

  • Skipping grounding instructions. Without "answer only from context", the model will happily blend retrieved info with its own (often wrong) priors.
  • No eval set. Without one, every tuning decision is vibes. Build one early — even 20 question/answer pairs help.
  • One global RAG for all questions. Some questions don't need retrieval ("hi", "thanks"). Route via a classifier or skip retrieval for very short queries.
  • Stuffing too many chunks into context. After ~6–8 the model gets confused, latency rises, and cost grows. Compress or rerank.
  • Forgetting tenant / permission filters. Multi-tenant RAG must filter on tenant in retrieval — never rely on the LLM to "not mention" other tenants' data.
  • Re-indexing entire corpus on every update. Use the indexing API or upsert with stable IDs.
  • No telemetry. RAG fails subtly — wrong chunk retrieved, weird truncation. Without LangSmith / equivalent tracing, you'll never spot it.

8. When to use vs not use

Use RAG when Don't use RAG when
Knowledge is private / large / changes The answer is in the model's training data
You need citations You need creative writing
Frequent updates make fine-tuning impractical You need to teach style/persona — fine-tune instead
You want to control what the model can know Latency is critical (RAG adds ~100–500ms)

For multi-hop reasoning ("compare X to Y across sources") or agentic loops with retrieval, consider LangGraph over a plain LCEL chain.

9. Cheatsheet

End-to-end import set:

# Loading
from langchain_community.document_loaders import (
    PyPDFLoader, WebBaseLoader, DirectoryLoader,
)

# Splitting
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Embedding + storage
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_chroma import Chroma
from langchain_pinecone import PineconeVectorStore

# Retrieval
from langchain.retrievers import EnsembleRetriever, ContextualCompressionRetriever
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.retrievers import BM25Retriever

# Generation
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

A working starter prompt:

You are a helpful assistant. Answer the user's question using ONLY the provided context.
- If the context doesn't contain the answer, say "I don't know based on the documents."
- Cite source filenames and page numbers when possible.
- Be concise.

Context:
{context}

Question: {question}

Answer:

Defaults that work well in 2025:

Knob Default
Chunk size 1000 chars
Chunk overlap 200 chars
Embedding model text-embedding-3-small (1536 dims)
Vector store Chroma (local) / Pinecone (prod)
Retriever MMR, k=4, fetch_k=20
Generation model gpt-4o-mini, temperature=0

10. Q&A — recall test

  • Q: What are the seven steps of a RAG pipeline? A: Load → Split → Embed → Store → Retrieve → Prompt → Generate. The first four run offline (indexing); the last three run per query.

  • Q: Why RAG over fine-tuning? A: Cheaper, faster to update, gives citations, doesn't overwrite the model's general capabilities. Fine-tuning is for teaching style or behavior, not facts.

  • Q: Single biggest cause of bad RAG answers? A: Bad retrieval. The LLM only sees what's retrieved — if the right chunk isn't in top-k, the answer is wrong/missing. Improve chunking, try MMR, add re-ranking.

  • Q: How do you handle hallucination in RAG? A: (a) Strong grounding instruction in the prompt ("answer only from context"). (b) Force the model to quote sources. © Run an eval set regularly. (d) For high-stakes, add a verifier step.

  • Q: When should retrieval be skipped? A: Short pleasantries ("hi", "thanks"), questions about the model itself, or queries already classified as out-of-scope. Either route via a classifier or set a similarity-score threshold below which you bypass retrieval.

Practice

What does this print?

Expected: 5

# RAG pipeline conceptually: query → retrieve → augment → generate
steps = ["embed query", "retrieve", "rerank", "augment prompt", "generate"]
print(len(steps))

Tell the LLM to USE the context (otherwise it may ignore it)

Expected: True

prompt_template = "Question: {question}"      # bug: no instruction to use the retrieved context
good_template = "Use the context to answer.\nContext: {context}\nQuestion: {question}"
print("context" in good_template.lower())

Quiz — Quick check

What you remember

Q1. What problem does RAG primarily solve?

  • Grounds the LLM in up-to-date, citable knowledge — reduces hallucinations
  • Makes LLMs faster
  • Reduces token usage
  • Improves reasoning

Why: LLMs only know what they were trained on. RAG retrieves fresh, specific content (your docs, current data) and adds it to the prompt. Now the LLM answers from real evidence, not vague training memories.

Q2. What's the typical RAG prompt template structure?

  • "Answer the question using the provided context. Context: {context}. Question: {question}."
  • Just the question
  • Just the context
  • The LLM figures it out

Why: The instruction matters. Without "use the context", the LLM may ignore the retrieved chunks and answer from its own training data. Be explicit.

Q3. How do you reduce hallucinations in a RAG system?

  • All of the above — better retrieval, explicit instructions to cite, low temperature, smaller model judging the answer
  • Use a smaller LLM
  • Increase context length
  • Lower the embedding dimensions

Why: Hallucinations come from multiple sources. Multi-pronged fix: improve retrieval (better embeddings, reranking), prompt for citations, temperature=0, and consider a "fact-check" step that validates the answer against the retrieved context.

Common doubts

RAG vs fine-tuning — which should I do?

RAG first. Cheaper, faster to iterate, lets you update knowledge without retraining. Fine-tuning is for style/format adjustments, very specialized terminology, or when retrieval isn't enough. Most production LLM apps are RAG + light prompting; few need fine-tuning.

How long should the retrieved context be?

Just enough — typically 4-8 chunks of ~500 tokens each = 2-4K tokens. More context can dilute focus and slow generation. If you find yourself needing huge contexts, your retrieval is probably not precise enough.

How do I cite sources in the LLM response?

Tell the LLM to. Include source metadata in the chunks (e.g., [Source: filename, page 12]) and instruct: "Cite the source for each claim using [Source: ...] format." Post-process to convert these into clickable links in your UI.