Course  /  10 · RAG
SECTION 10 LAB CORE

Retrieval-Augmented
Generation

An LLM's knowledge is frozen at training time. RAG breaks that constraint: instead of relying on weights alone, the agent retrieves relevant documents at query time and injects them into the context before generating. This section covers the full pipeline — chunking, embedding, vector search, and reranking — and the lab builds a working RAG system from scratch using ChromaDB and the Anthropic API.

01 · RAG ARCHITECTURE

Retrieve, Augment, Generate

Retrieval-Augmented Generation (RAG) was formalized by Lewis et al. (2020) as a method for grounding LLM outputs in a document corpus. The core idea: at inference time, retrieve the most relevant documents for the user's query, inject them into the prompt as context, and then let the LLM generate a response that is grounded in that retrieved evidence — rather than relying purely on parametric knowledge baked into the weights during training.

PHASE 1
🔍
Retrieval
Convert the query to an embedding, search the vector store, return the top-k most similar chunks.
PHASE 2
📎
Augmentation
Assemble retrieved chunks into a context block, inject into the prompt alongside the original query.
PHASE 3
Generation
The LLM generates a response grounded in the retrieved context. The response can cite specific chunks.
RAG vs fine-tuning: Fine-tuning bakes knowledge into model weights — expensive, requires retraining when knowledge changes, and the model can still confabulate. RAG keeps knowledge in an external store — cheaper to update, auditable (you can log what was retrieved), and grounded (the model can quote its sources). Most production knowledge-retrieval systems use RAG; fine-tuning is reserved for changing behavior, not updating facts.
VARIANT 01
Naive RAG
Index documents → embed query → top-k similarity search → stuff chunks into prompt → generate. Simple but brittle: no reranking, no query reformulation, sensitive to chunk boundaries.
BASELINE (2020–2022)
VARIANT 02
Advanced RAG
Adds query rewriting, hybrid search (dense + BM25), reranking, and context compression. Significantly improves precision and faithfulness at the cost of pipeline complexity.
WIDELY USED (2023–2026)
VARIANT 03
Modular RAG
Decomposes the pipeline into swappable modules: retriever, reranker, filter, generator. Enables A/B testing individual components and mixing retrieval strategies per query type.
EMERGING PATTERN (2024–2026)
VARIANT 04
Agentic RAG
The LLM decides when and what to retrieve, formulates multi-step retrieval plans, and iteratively refines queries based on retrieved context. RAG becomes a tool call in an agent loop.
EMERGING PATTERN (2024–2026)
02 · CHUNKING STRATEGIES

How You Split Documents Determines What You Retrieve

Before indexing, documents must be split into chunks — the units that will be embedded and stored. Chunk size and splitting strategy have an outsized effect on retrieval quality. Too large and chunks are noisy (irrelevant text dilutes the signal). Too small and chunks lose surrounding context needed to interpret them correctly.

StrategyHow It WorksBest ForTrade-off
Fixed-size Split every N tokens, with optional overlap (e.g., 512 tokens, 50 overlap) Simple uniform corpora, quick prototype May cut sentences mid-thought; overlap creates duplication
Recursive character Split on paragraphs → sentences → words until chunks are under max size General-purpose; default in LangChain Still size-based; does not respect semantic structure
Sentence-window Embed each sentence individually; retrieve with ±N surrounding sentences for context Precise retrieval + full surrounding context More chunks; context assembly logic required
Semantic chunking Embed sentences; split when embedding cosine similarity drops below a threshold Corpora with topic shifts mid-document Slower ingestion; variable chunk sizes
Document structure Split on HTML/Markdown headers, PDF sections, code blocks Structured docs (wikis, API refs, PDFs) Requires parser per format; sections vary wildly in size
Chunk size rule of thumb: For question-answering, 256–512 tokens with 10–15% overlap is a reliable starting point. For long-form synthesis (summarization, report generation), 1,024–2,048 tokens per chunk gives the model more complete "units of thought" to work with. Always evaluate on your actual retrieval task — the right chunk size is domain-specific.
03 · EMBEDDINGS & VECTOR SEARCH

Turning Text Into Searchable Geometry

An embedding model converts a piece of text into a dense vector of floating-point numbers — a point in a high-dimensional space (typically 768–3,072 dimensions) where semantically similar texts are geometrically close. This is the foundation of semantic search: instead of matching keywords, you match meaning.

Retrieval works by embedding the query, then finding the stored chunk vectors nearest to it. The two dominant similarity metrics are cosine similarity (angle between vectors — scale-invariant) and dot product (angle × magnitude — better when embedding norms carry signal). Cosine similarity is the default in most vector stores.

SEARCH TYPE 01
Dense Retrieval (ANN)
Approximate Nearest Neighbor (ANN) search over embedding vectors. Fast, semantic, handles synonyms and paraphrases. Requires an embedding model at index and query time. HNSW is the standard ANN index algorithm.
WIDELY USED (2021–2026)
SEARCH TYPE 02
Sparse Retrieval (BM25)
Term-frequency inverse-document-frequency search — exact keyword matching with frequency weighting. Fast, no embedding model needed. Excels at matching rare terms, IDs, product codes, and proper nouns that embedding models may not encode well.
WIDELY USED (classic–2026)
SEARCH TYPE 03
Hybrid Search
Run dense and sparse retrieval in parallel; merge results using a fusion algorithm (e.g., Reciprocal Rank Fusion). Combines semantic matching with exact-term precision. The standard in production RAG systems as of 2024–2026.
EMERGING STANDARD (2023–2026)
TOOL 01
ChromaDB
Open-source, embedded vector database. Runs in-process (no server) or as a server. Default embedding: sentence-transformers. Supports metadata filtering, persistent storage, and multiple collections. Ideal for prototyping and small-to-medium corpora.
WIDELY USED (2023–2026)
Embedding model matters: The embedding model is the biggest lever on retrieval quality. For English-only corpora, all-MiniLM-L6-v2 is fast and compact (22M params). For multilingual or higher-stakes use, larger models (Voyage AI, OpenAI text-embedding-3, Cohere embed-v3) outperform significantly. Anthropic recommends Voyage AI for embeddings used with Claude.
04 · RERANKING & CONTEXT ASSEMBLY

Precision After Recall

Vector search is optimized for recall — returning everything that might be relevant. But recall ≠ precision. Top-k results often include chunks that are tangentially related, duplicative, or poorly ordered. Reranking is a second-pass step that takes the candidate set from the retriever and scores each chunk against the query with higher precision.

APPROACH A — CROSS-ENCODER RERANKER

A small transformer encodes the (query, chunk) pair jointly — unlike bi-encoders that embed query and chunk independently. Cross-encoders produce more accurate relevance scores but are too slow to run over the entire corpus, so they are applied only to the top-k candidates from the first-stage retriever. Cohere Rerank, Voyage AI reranking, and local cross-encoders (sentence-transformers) are common options.

APPROACH B — LLM-AS-RERANKER

For small candidate sets, ask the LLM itself to rank passages by relevance. "Given these 10 passages and the query, rank them from most to least relevant." Slower and more expensive than a dedicated reranker, but useful when no reranking model is available and the candidate set is small (3–10 chunks).

Once reranking is done, the top chunks are assembled into a context block for the prompt. Key assembly decisions:

  • Order: Place the most relevant chunk first (primacy effect) or last (recency effect). Avoid burying high-signal chunks in the middle (Lost in the Middle, Section 04).
  • Deduplication: Chunks from the same document may overlap. Hash or fuzzy-deduplicate before assembly.
  • Citation markers: Tag each chunk with a source ID so the LLM can cite it: [Source: doc_id, chunk_3].
  • Budget: Reserve enough context for the query, the system prompt, and the generated response. Truncate lower-ranked chunks first when context is tight.
05 · RAG FAILURE MODES & EVALUATION

What Goes Wrong — and How to Measure It

RAG introduces two compounding failure modes: retrieval failures (wrong chunks retrieved) and generation failures (wrong answer despite correct retrieval). A robust RAG pipeline must measure both independently.

RETRIEVAL FAILURE 01
Retrieval Miss
The answer exists in the corpus but was not retrieved. Causes: poor chunking that split the answer across boundaries, embedding model doesn't capture the query-document semantic relationship, or the query is too ambiguous.
ACTIVE RISK
RETRIEVAL FAILURE 02
Noisy Retrieval
Irrelevant chunks retrieved alongside relevant ones. The LLM must filter signal from noise in the context. High noise increases hallucination risk: the model may blend irrelevant retrieved content with parametric knowledge.
ACTIVE RISK
GENERATION FAILURE 01
Hallucination Despite Retrieval
The correct chunks are retrieved, but the LLM ignores them and generates from parametric memory anyway. Mitigation: strong system prompt instruction to answer only from provided context; "I don't know" as an explicit valid response.
ACTIVE RISK
GENERATION FAILURE 02
Context Stuffing
Injecting too many chunks degrades generation quality. The LLM must locate the relevant sentence within a large noisy context block — triggering the Lost-in-the-Middle effect. Cap context chunks; prefer precision over recall at assembly time.
ACTIVE RISK

The RAG eval framework from Gao et al. (2023) measures three dimensions independently:

MetricWhat It MeasuresHow to Compute
Context RecallDid retrieval surface the answer-bearing chunks?Check if ground-truth answer source appears in retrieved set
Context PrecisionWhat fraction of retrieved chunks are relevant?Label retrieved chunks as relevant/irrelevant; compute precision
Answer FaithfulnessIs the generated answer grounded in the retrieved context?LLM-as-judge: does every claim in the answer appear in the retrieved chunks?
Answer RelevanceDoes the answer address the original question?LLM-as-judge or human eval on generated output vs. query
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

SourceTypeCoversRecency
Lewis et al. — Retrieval-Augmented Generation Academic paper (Meta) Original RAG formulation — retrieval, augmentation, generation pipeline 2020 (foundational)
Gao et al. — RAG Survey Academic survey Naive, Advanced, and Modular RAG; chunking; evaluation framework 2023
Karpukhin et al. — Dense Passage Retrieval (DPR) Academic paper (Meta) Dense retrieval, bi-encoder architecture, ANN search 2020 (foundational)
Liu et al. — Lost in the Middle Academic paper Context position bias, chunk ordering in RAG context assembly 2023
ChromaDB Documentation Official docs Vector store setup, collections, embedding functions, metadata filtering Maintained 2023–2026
Anthropic — Embeddings (Voyage AI) Official docs Embedding model recommendations for use with Claude Maintained 2024–2026
HANDS-ON LAB

Build a RAG Pipeline from Scratch

You will build a working RAG system in Python: chunk a small document corpus, index it in ChromaDB, retrieve with semantic search, and generate grounded answers using the Anthropic API. The complete script is rag_agent.py.

🔬
Section 10 Lab — RAG Pipeline
6 STEPS · PYTHON · ~45 MIN
1
Install dependencies and create the file

ChromaDB ships its own embedding model (sentence-transformers), so no separate embedding API key is required for this lab. All you need is your Anthropic API key for the generation step.

BASH
pip install chromadb anthropic
BASH
touch rag_agent.py
ChromaDB's default embedding function uses all-MiniLM-L6-v2 — a compact (22M parameter) sentence-transformer that runs locally. No internet access or embedding API key required during retrieval.
2
Set up the corpus and chunk it

We use a small hardcoded corpus of paragraphs about agentic AI topics. Each paragraph is one chunk. In a real system you would load documents from files, PDFs, or a database and split them using a proper chunking strategy.

PYTHON — rag_agent.py
import os
import chromadb
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Our corpus: each entry is one "document chunk"
CORPUS = [
    {
        "id": "chunk_01",
        "text": (
            "ReAct agents interleave reasoning and action in a loop. "
            "The model produces a Thought, selects an Action (tool call), "
            "receives an Observation, and repeats until it emits a final answer. "
            "This was introduced by Yao et al. (2022) in arXiv:2210.03629."
        ),
        "source": "section-05-agents",
    },
    {
        "id": "chunk_02",
        "text": (
            "Memory in agents is classified into four types: in-context (the active prompt window), "
            "external semantic (vector stores), episodic (structured logs of past sessions), "
            "and procedural (persistent rules embedded in the system prompt). "
            "This taxonomy comes from Lilian Weng's 2023 survey of LLM-powered agents."
        ),
        "source": "section-06-memory",
    },
    {
        "id": "chunk_03",
        "text": (
            "Plan-and-execute agents separate planning from execution. A planner LLM generates "
            "a full step list upfront. An executor LLM works through each step using tools. "
            "When a step fails, the planner is called again with a context summary to replan. "
            "This pattern is documented in Wang et al. (2023), arXiv:2305.04091."
        ),
        "source": "section-07-planning",
    },
    {
        "id": "chunk_04",
        "text": (
            "Prompt injection is the top LLM-specific vulnerability in the OWASP LLM Top 10. "
            "An attacker embeds instructions in user-controlled content (e.g., a webpage or file) "
            "that override the agent's system prompt and redirect its behavior. "
            "Defences include input delimiting, privilege separation, and instruction anchoring."
        ),
        "source": "section-08-prompting",
    },
    {
        "id": "chunk_05",
        "text": (
            "RLHF has three stages: SFT (supervised fine-tuning on demonstration data), "
            "reward model training (on human preference rankings of response pairs), "
            "and PPO fine-tuning (updating the policy to maximize the reward model's score). "
            "A KL-divergence penalty prevents the model from drifting too far from the SFT baseline."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_06",
        "text": (
            "Constitutional AI (Bai et al., 2022) replaces the human preference labeling step in RLHF "
            "with AI-generated feedback guided by a written set of principles. "
            "The AI critiques and revises its own responses according to the constitution, "
            "and AI-generated preference rankings replace human annotators in the RLAIF stage."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_07",
        "text": (
            "DPO (Direct Preference Optimization, Rafailov et al. 2023) eliminates the separate "
            "reward model and PPO training loop from RLHF. It reparametrizes the reward in terms "
            "of the policy itself, training directly on preference pairs via binary cross-entropy. "
            "DPO is now the dominant alignment fine-tuning method for open-weight models."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_08",
        "text": (
            "Transformers use self-attention to process all tokens in a sequence in parallel. "
            "Every token attends to every other token simultaneously, computing a weighted sum "
            "of value vectors based on query-key dot products. This replaced recurrent (RNN) processing "
            "and is the architectural foundation of all modern large language models."
        ),
        "source": "section-04-llms",
    },
]
3
Build the vector index

Create an in-memory ChromaDB collection and add all corpus chunks. ChromaDB automatically embeds each chunk's text using the default sentence-transformer model.

PYTHON — rag_agent.py (continued)
def build_index() -> chromadb.Collection:
    """Build an in-memory ChromaDB collection from the corpus."""
    db = chromadb.Client()  # ephemeral in-memory client
    collection = db.create_collection(
        name="course_knowledge",
        metadata={"hnsw:space": "cosine"}  # cosine similarity
    )

    collection.add(
        ids=[chunk["id"] for chunk in CORPUS],
        documents=[chunk["text"] for chunk in CORPUS],
        metadatas=[{"source": chunk["source"]} for chunk in CORPUS],
    )

    print(f"Index built: {collection.count()} chunks indexed.")
    return collection
Persistence: This uses an ephemeral in-memory client — data is lost when the script exits. For production, use chromadb.PersistentClient(path="./chroma_db") to save the index to disk. Only run build_index() when new documents are added; load the existing collection on subsequent runs.
4
Implement the retriever and the RAG generator

The retriever embeds the query and returns the top-k most similar chunks. The generator assembles them into a context block and calls the Anthropic API, instructed to answer only from the provided context.

PYTHON — rag_agent.py (continued)
TOP_K = 3

RAG_SYSTEM = """You are a precise question-answering assistant.
Answer the user's question using ONLY the information in the provided context blocks.
Each context block is labeled with a source ID.
If the answer is not in the context, say: "I don't have that information in the provided context."
Do not use outside knowledge. Cite the source ID when you use it."""


def retrieve(collection: chromadb.Collection, query: str, top_k: int = TOP_K) -> list[dict]:
    """Retrieve the top-k most relevant chunks for a query."""
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    chunks = []
    for i in range(len(results["ids"][0])):
        chunks.append({
            "id":       results["ids"][0][i],
            "text":     results["documents"][0][i],
            "source":   results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i],
        })
    return chunks


def rag_answer(collection: chromadb.Collection, query: str) -> str:
    """Full RAG pipeline: retrieve → augment → generate."""
    # 1. Retrieve
    chunks = retrieve(collection, query)

    print(f"\n  Retrieved {len(chunks)} chunks:")
    for c in chunks:
        print(f"    [{c['id']}] distance={c['distance']:.3f} source={c['source']}")
        print(f"    {c['text'][:80]}...")

    # 2. Augment — assemble context block
    context_block = "\n\n".join(
        f"[Source: {c['id']} | {c['source']}]\n{c['text']}"
        for c in chunks
    )
    augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}"

    # 3. Generate
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=RAG_SYSTEM,
        messages=[{"role": "user", "content": augmented_query}],
    )
    return response.content[0].text
Why Haiku? This lab uses the smallest Claude model to keep costs low during experimentation. The generation quality depends far more on retrieval quality than model size in a RAG setting. Switch to a larger model only when the context is complex and synthesis quality matters.
5
Run the full pipeline and test with queries
PYTHON — rag_agent.py (continued)
TEST_QUERIES = [
    "How does a ReAct agent loop work?",
    "What are the four types of memory in agents?",
    "How does DPO differ from RLHF?",
    "What is prompt injection and how is it defended against?",
    "What year was the iPhone 15 released?",  # out-of-corpus query
]


if __name__ == "__main__":
    collection = build_index()

    for query in TEST_QUERIES:
        print(f"\n{'='*60}")
        print(f"QUERY: {query}")
        answer = rag_answer(collection, query)
        print(f"\nANSWER:\n{answer}")
BASH
python rag_agent.py
EXPECTED OUTPUT (abridged)
Index built: 8 chunks indexed.

============================================================
QUERY: How does a ReAct agent loop work?

  Retrieved 3 chunks:
    [chunk_01] distance=0.081 source=section-05-agents
    The ReAct agent interleaves reasoning and action in a loop...
    [chunk_03] distance=0.412 source=section-07-planning
    Plan-and-execute agents separate planning from execution...
    [chunk_02] distance=0.451 source=section-06-memory
    Memory in agents is classified into four types...

ANSWER:
According to [chunk_01 | section-05-agents], a ReAct agent loop works
by interleaving reasoning and action: the model produces a Thought,
selects an Action (a tool call), receives an Observation, and repeats
this cycle until it emits a final answer. This approach was introduced
by Yao et al. (2022).

============================================================
QUERY: What year was the iPhone 15 released?

  Retrieved 3 chunks:
    [chunk_04] distance=0.682 ...
    [chunk_01] distance=0.711 ...
    [chunk_07] distance=0.731 ...

ANSWER:
I don't have that information in the provided context.
What to observe: (1) The first query retrieves chunk_01 with very low distance (0.08) — high confidence match. (2) The out-of-corpus query retrieves chunks with distances above 0.6 — far away semantically — and the model correctly refuses to answer rather than hallucinating. This is the core RAG safety property: the model cannot answer what isn't in the retrieved context.
6
Extension: add an LLM reranker for multi-chunk queries

Retrieve more candidates than needed (top-6), then ask the LLM to rank them by relevance before assembly. This is the "LLM-as-reranker" pattern — useful when you don't have a dedicated cross-encoder model and the candidate set is small.

PYTHON — add to rag_agent.py
RERANKER_SYSTEM = """You are a relevance ranker.
Given a query and a list of numbered passages, return ONLY a JSON array
of passage numbers in order of relevance to the query, most relevant first.
Example: [2, 5, 1] means passage 2 is most relevant, then 5, then 1.
Return ONLY the JSON array, no other text."""


def llm_rerank(query: str, chunks: list[dict], keep: int = 3) -> list[dict]:
    """Use the LLM to rerank a candidate chunk list."""
    passages = "\n\n".join(
        f"[{i+1}] {c['text'][:200]}" for i, c in enumerate(chunks)
    )
    prompt = f"Query: {query}\n\nPassages:\n{passages}"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=64,
        system=RERANKER_SYSTEM,
        messages=[{"role": "user", "content": prompt}],
    )
    import json
    ranked_indices = json.loads(response.content[0].text.strip())
    # Convert 1-indexed to 0-indexed, cap at available chunks
    reranked = [chunks[i - 1] for i in ranked_indices if 1 <= i <= len(chunks)]
    return reranked[:keep]


def rag_answer_reranked(collection: chromadb.Collection, query: str) -> str:
    """RAG with LLM reranking: retrieve 6 → rerank → take top 3 → generate."""
    candidates = retrieve(collection, query, top_k=6)
    top_chunks = llm_rerank(query, candidates, keep=3)

    print(f"  After reranking, using chunks: {[c['id'] for c in top_chunks]}")

    context_block = "\n\n".join(
        f"[Source: {c['id']} | {c['source']}]\n{c['text']}"
        for c in top_chunks
    )
    augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=RAG_SYSTEM,
        messages=[{"role": "user", "content": augmented_query}],
    )
    return response.content[0].text
When reranking helps most: Queries where the initial embedding distance scores are clustered (all between 0.3–0.5) — the retriever found several plausible candidates but couldn't confidently rank them. The LLM's joint understanding of query + passage pairs resolves the tie more accurately than cosine similarity alone.

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: