SECTION 10 LAB CORE

Retrieval-Augmented
Generation

An LLM's knowledge is frozen at training time. RAG breaks that constraint: instead of relying on weights alone, the agent retrieves relevant documents at query time and injects them into the context before generating. This section covers the full pipeline — chunking, embedding, vector search, and reranking — and the lab builds a working RAG system from scratch using ChromaDB and the Anthropic API.

📖 Start Theory 🔬 Jump to Lab

01 · RAG ARCHITECTURE

Retrieve, Augment, Generate

Retrieval-Augmented Generation (RAG) was formalized by Lewis et al. (2020) as a method for grounding LLM outputs in a document corpus. The core idea: at inference time, retrieve the most relevant documents for the user's query, inject them into the prompt as context, and then let the LLM generate a response that is grounded in that retrieved evidence — rather than relying purely on parametric knowledge baked into the weights during training.

PHASE 1

🔍

Retrieval

Convert the query to an embedding, search the vector store, return the top-k most similar chunks.

PHASE 2

📎

Augmentation

Assemble retrieved chunks into a context block, inject into the prompt alongside the original query.

PHASE 3

✍

Generation

The LLM generates a response grounded in the retrieved context. The response can cite specific chunks.

RAG vs fine-tuning: Fine-tuning bakes knowledge into model weights — expensive, requires retraining when knowledge changes, and the model can still confabulate. RAG keeps knowledge in an external store — cheaper to update, auditable (you can log what was retrieved), and grounded (the model can quote its sources). Most production knowledge-retrieval systems use RAG; fine-tuning is reserved for changing behavior, not updating facts.

VARIANT 01

Naive RAG

Index documents → embed query → top-k similarity search → stuff chunks into prompt → generate. Simple but brittle: no reranking, no query reformulation, sensitive to chunk boundaries.

BASELINE (2020–2022)

VARIANT 02

Advanced RAG

Adds query rewriting, hybrid search (dense + BM25), reranking, and context compression. Significantly improves precision and faithfulness at the cost of pipeline complexity.

WIDELY USED (2023–2026)

VARIANT 03

Modular RAG

Decomposes the pipeline into swappable modules: retriever, reranker, filter, generator. Enables A/B testing individual components and mixing retrieval strategies per query type.

EMERGING PATTERN (2024–2026)

VARIANT 04

Agentic RAG

The LLM decides when and what to retrieve, formulates multi-step retrieval plans, and iteratively refines queries based on retrieved context. RAG becomes a tool call in an agent loop.

EMERGING PATTERN (2024–2026)

📎 Sources: Lewis et al. — RAG (arXiv:2005.11401, 2020) · Gao et al. — RAG Survey (arXiv:2312.10997, 2023)

02 · CHUNKING STRATEGIES

How You Split Documents Determines What You Retrieve

Before indexing, documents must be split into chunks — the units that will be embedded and stored. Chunk size and splitting strategy have an outsized effect on retrieval quality. Too large and chunks are noisy (irrelevant text dilutes the signal). Too small and chunks lose surrounding context needed to interpret them correctly.

Strategy	How It Works	Best For	Trade-off
Fixed-size	Split every N tokens, with optional overlap (e.g., 512 tokens, 50 overlap)	Simple uniform corpora, quick prototype	May cut sentences mid-thought; overlap creates duplication
Recursive character	Split on paragraphs → sentences → words until chunks are under max size	General-purpose; default in LangChain	Still size-based; does not respect semantic structure
Sentence-window	Embed each sentence individually; retrieve with ±N surrounding sentences for context	Precise retrieval + full surrounding context	More chunks; context assembly logic required
Semantic chunking	Embed sentences; split when embedding cosine similarity drops below a threshold	Corpora with topic shifts mid-document	Slower ingestion; variable chunk sizes
Document structure	Split on HTML/Markdown headers, PDF sections, code blocks	Structured docs (wikis, API refs, PDFs)	Requires parser per format; sections vary wildly in size

Chunk size rule of thumb: For question-answering, 256–512 tokens with 10–15% overlap is a reliable starting point. For long-form synthesis (summarization, report generation), 1,024–2,048 tokens per chunk gives the model more complete "units of thought" to work with. Always evaluate on your actual retrieval task — the right chunk size is domain-specific.

📎 Sources: Gao et al. — RAG Survey (arXiv:2312.10997, 2023) · ChromaDB Docs

03 · EMBEDDINGS & VECTOR SEARCH

Turning Text Into Searchable Geometry

An embedding model converts a piece of text into a dense vector of floating-point numbers — a point in a high-dimensional space (typically 768–3,072 dimensions) where semantically similar texts are geometrically close. This is the foundation of semantic search: instead of matching keywords, you match meaning.

Retrieval works by embedding the query, then finding the stored chunk vectors nearest to it. The two dominant similarity metrics are cosine similarity (angle between vectors — scale-invariant) and dot product (angle × magnitude — better when embedding norms carry signal). Cosine similarity is the default in most vector stores.

SEARCH TYPE 01

Dense Retrieval (ANN)

Approximate Nearest Neighbor (ANN) search over embedding vectors. Fast, semantic, handles synonyms and paraphrases. Requires an embedding model at index and query time. HNSW is the standard ANN index algorithm.

WIDELY USED (2021–2026)

SEARCH TYPE 02

Sparse Retrieval (BM25)

Term-frequency inverse-document-frequency search — exact keyword matching with frequency weighting. Fast, no embedding model needed. Excels at matching rare terms, IDs, product codes, and proper nouns that embedding models may not encode well.

WIDELY USED (classic–2026)

SEARCH TYPE 03

Hybrid Search

Run dense and sparse retrieval in parallel; merge results using a fusion algorithm (e.g., Reciprocal Rank Fusion). Combines semantic matching with exact-term precision. The standard in production RAG systems as of 2024–2026.

EMERGING STANDARD (2023–2026)

TOOL 01

ChromaDB

Open-source, embedded vector database. Runs in-process (no server) or as a server. Default embedding: sentence-transformers. Supports metadata filtering, persistent storage, and multiple collections. Ideal for prototyping and small-to-medium corpora.

WIDELY USED (2023–2026)

Embedding model matters: The embedding model is the biggest lever on retrieval quality. For English-only corpora, all-MiniLM-L6-v2 is fast and compact (22M params). For multilingual or higher-stakes use, larger models (Voyage AI, OpenAI text-embedding-3, Cohere embed-v3) outperform significantly. Anthropic recommends Voyage AI for embeddings used with Claude.

📎 Sources: Karpukhin et al. — DPR (arXiv:2004.04906, 2020) · ChromaDB Docs · Anthropic — Embeddings (Voyage AI)

04 · RERANKING & CONTEXT ASSEMBLY

Precision After Recall

Vector search is optimized for recall — returning everything that might be relevant. But recall ≠ precision. Top-k results often include chunks that are tangentially related, duplicative, or poorly ordered. Reranking is a second-pass step that takes the candidate set from the retriever and scores each chunk against the query with higher precision.

APPROACH A — CROSS-ENCODER RERANKER

A small transformer encodes the (query, chunk) pair jointly — unlike bi-encoders that embed query and chunk independently. Cross-encoders produce more accurate relevance scores but are too slow to run over the entire corpus, so they are applied only to the top-k candidates from the first-stage retriever. Cohere Rerank, Voyage AI reranking, and local cross-encoders (sentence-transformers) are common options.

APPROACH B — LLM-AS-RERANKER

For small candidate sets, ask the LLM itself to rank passages by relevance. "Given these 10 passages and the query, rank them from most to least relevant." Slower and more expensive than a dedicated reranker, but useful when no reranking model is available and the candidate set is small (3–10 chunks).

Once reranking is done, the top chunks are assembled into a context block for the prompt. Key assembly decisions:

Order: Place the most relevant chunk first (primacy effect) or last (recency effect). Avoid burying high-signal chunks in the middle (Lost in the Middle, Section 04).
Deduplication: Chunks from the same document may overlap. Hash or fuzzy-deduplicate before assembly.
Citation markers: Tag each chunk with a source ID so the LLM can cite it: [Source: doc_id, chunk_3].
Budget: Reserve enough context for the query, the system prompt, and the generated response. Truncate lower-ranked chunks first when context is tight.

📎 Sources: Gao et al. — RAG Survey (arXiv:2312.10997, 2023) · Liu et al. — Lost in the Middle (arXiv:2307.03172, 2023)

05 · RAG FAILURE MODES & EVALUATION

What Goes Wrong — and How to Measure It

RAG introduces two compounding failure modes: retrieval failures (wrong chunks retrieved) and generation failures (wrong answer despite correct retrieval). A robust RAG pipeline must measure both independently.

RETRIEVAL FAILURE 01

Retrieval Miss

The answer exists in the corpus but was not retrieved. Causes: poor chunking that split the answer across boundaries, embedding model doesn't capture the query-document semantic relationship, or the query is too ambiguous.

ACTIVE RISK

RETRIEVAL FAILURE 02

Noisy Retrieval

Irrelevant chunks retrieved alongside relevant ones. The LLM must filter signal from noise in the context. High noise increases hallucination risk: the model may blend irrelevant retrieved content with parametric knowledge.

ACTIVE RISK

GENERATION FAILURE 01

Hallucination Despite Retrieval

The correct chunks are retrieved, but the LLM ignores them and generates from parametric memory anyway. Mitigation: strong system prompt instruction to answer only from provided context; "I don't know" as an explicit valid response.

ACTIVE RISK

GENERATION FAILURE 02

Context Stuffing

Injecting too many chunks degrades generation quality. The LLM must locate the relevant sentence within a large noisy context block — triggering the Lost-in-the-Middle effect. Cap context chunks; prefer precision over recall at assembly time.

ACTIVE RISK

The RAG eval framework from Gao et al. (2023) measures three dimensions independently:

Metric	What It Measures	How to Compute
Context Recall	Did retrieval surface the answer-bearing chunks?	Check if ground-truth answer source appears in retrieved set
Context Precision	What fraction of retrieved chunks are relevant?	Label retrieved chunks as relevant/irrelevant; compute precision
Answer Faithfulness	Is the generated answer grounded in the retrieved context?	LLM-as-judge: does every claim in the answer appear in the retrieved chunks?
Answer Relevance	Does the answer address the original question?	LLM-as-judge or human eval on generated output vs. query

📎 Sources: Gao et al. — RAG Survey (arXiv:2312.10997, 2023) · Liu et al. — Lost in the Middle (arXiv:2307.03172, 2023)

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source	Type	Covers	Recency
Lewis et al. — Retrieval-Augmented Generation	Academic paper (Meta)	Original RAG formulation — retrieval, augmentation, generation pipeline	2020 (foundational)
Gao et al. — RAG Survey	Academic survey	Naive, Advanced, and Modular RAG; chunking; evaluation framework	2023
Karpukhin et al. — Dense Passage Retrieval (DPR)	Academic paper (Meta)	Dense retrieval, bi-encoder architecture, ANN search	2020 (foundational)
Liu et al. — Lost in the Middle	Academic paper	Context position bias, chunk ordering in RAG context assembly	2023
ChromaDB Documentation	Official docs	Vector store setup, collections, embedding functions, metadata filtering	Maintained 2023–2026
Anthropic — Embeddings (Voyage AI)	Official docs	Embedding model recommendations for use with Claude	Maintained 2024–2026

HANDS-ON LAB

Build a RAG Pipeline from Scratch

You will build a working RAG system in Python: chunk a small document corpus, index it in ChromaDB, retrieve with semantic search, and generate grounded answers using the Anthropic API. The complete script is rag_agent.py.

🔬

Section 10 Lab — RAG Pipeline

6 STEPS · PYTHON · ~45 MIN

Install dependencies and create the file

ChromaDB ships its own embedding model (sentence-transformers), so no separate embedding API key is required for this lab. All you need is your Anthropic API key for the generation step.

BASH

pip install chromadb anthropic

BASH

touch rag_agent.py

ChromaDB's default embedding function uses all-MiniLM-L6-v2 — a compact (22M parameter) sentence-transformer that runs locally. No internet access or embedding API key required during retrieval.

Set up the corpus and chunk it

We use a small hardcoded corpus of paragraphs about agentic AI topics. Each paragraph is one chunk. In a real system you would load documents from files, PDFs, or a database and split them using a proper chunking strategy.

PYTHON — rag_agent.py

import os
import chromadb
import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Our corpus: each entry is one "document chunk"
CORPUS = [
    {
        "id": "chunk_01",
        "text": (
            "ReAct agents interleave reasoning and action in a loop. "
            "The model produces a Thought, selects an Action (tool call), "
            "receives an Observation, and repeats until it emits a final answer. "
            "This was introduced by Yao et al. (2022) in arXiv:2210.03629."
        ),
        "source": "section-05-agents",
    },
    {
        "id": "chunk_02",
        "text": (
            "Memory in agents is classified into four types: in-context (the active prompt window), "
            "external semantic (vector stores), episodic (structured logs of past sessions), "
            "and procedural (persistent rules embedded in the system prompt). "
            "This taxonomy comes from Lilian Weng's 2023 survey of LLM-powered agents."
        ),
        "source": "section-06-memory",
    },
    {
        "id": "chunk_03",
        "text": (
            "Plan-and-execute agents separate planning from execution. A planner LLM generates "
            "a full step list upfront. An executor LLM works through each step using tools. "
            "When a step fails, the planner is called again with a context summary to replan. "
            "This pattern is documented in Wang et al. (2023), arXiv:2305.04091."
        ),
        "source": "section-07-planning",
    },
    {
        "id": "chunk_04",
        "text": (
            "Prompt injection is the top LLM-specific vulnerability in the OWASP LLM Top 10. "
            "An attacker embeds instructions in user-controlled content (e.g., a webpage or file) "
            "that override the agent's system prompt and redirect its behavior. "
            "Defences include input delimiting, privilege separation, and instruction anchoring."
        ),
        "source": "section-08-prompting",
    },
    {
        "id": "chunk_05",
        "text": (
            "RLHF has three stages: SFT (supervised fine-tuning on demonstration data), "
            "reward model training (on human preference rankings of response pairs), "
            "and PPO fine-tuning (updating the policy to maximize the reward model's score). "
            "A KL-divergence penalty prevents the model from drifting too far from the SFT baseline."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_06",
        "text": (
            "Constitutional AI (Bai et al., 2022) replaces the human preference labeling step in RLHF "
            "with AI-generated feedback guided by a written set of principles. "
            "The AI critiques and revises its own responses according to the constitution, "
            "and AI-generated preference rankings replace human annotators in the RLAIF stage."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_07",
        "text": (
            "DPO (Direct Preference Optimization, Rafailov et al. 2023) eliminates the separate "
            "reward model and PPO training loop from RLHF. It reparametrizes the reward in terms "
            "of the policy itself, training directly on preference pairs via binary cross-entropy. "
            "DPO is now the dominant alignment fine-tuning method for open-weight models."
        ),
        "source": "section-09-rl",
    },
    {
        "id": "chunk_08",
        "text": (
            "Transformers use self-attention to process all tokens in a sequence in parallel. "
            "Every token attends to every other token simultaneously, computing a weighted sum "
            "of value vectors based on query-key dot products. This replaced recurrent (RNN) processing "
            "and is the architectural foundation of all modern large language models."
        ),
        "source": "section-04-llms",
    },
]

Build the vector index

Create an in-memory ChromaDB collection and add all corpus chunks. ChromaDB automatically embeds each chunk's text using the default sentence-transformer model.

PYTHON — rag_agent.py (continued)

def build_index() -> chromadb.Collection:
    """Build an in-memory ChromaDB collection from the corpus."""
    db = chromadb.Client()  # ephemeral in-memory client
    collection = db.create_collection(
        name="course_knowledge",
        metadata={"hnsw:space": "cosine"}  # cosine similarity
    )

    collection.add(
        ids=[chunk["id"] for chunk in CORPUS],
        documents=[chunk["text"] for chunk in CORPUS],
        metadatas=[{"source": chunk["source"]} for chunk in CORPUS],
    )

    print(f"Index built: {collection.count()} chunks indexed.")
    return collection

Persistence: This uses an ephemeral in-memory client — data is lost when the script exits. For production, use chromadb.PersistentClient(path="./chroma_db") to save the index to disk. Only run build_index() when new documents are added; load the existing collection on subsequent runs.

Implement the retriever and the RAG generator

The retriever embeds the query and returns the top-k most similar chunks. The generator assembles them into a context block and calls the Anthropic API, instructed to answer only from the provided context.

PYTHON — rag_agent.py (continued)

TOP_K = 3

RAG_SYSTEM = """You are a precise question-answering assistant.
Answer the user's question using ONLY the information in the provided context blocks.
Each context block is labeled with a source ID.
If the answer is not in the context, say: "I don't have that information in the provided context."
Do not use outside knowledge. Cite the source ID when you use it."""


def retrieve(collection: chromadb.Collection, query: str, top_k: int = TOP_K) -> list[dict]:
    """Retrieve the top-k most relevant chunks for a query."""
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    chunks = []
    for i in range(len(results["ids"][0])):
        chunks.append({
            "id":       results["ids"][0][i],
            "text":     results["documents"][0][i],
            "source":   results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i],
        })
    return chunks


def rag_answer(collection: chromadb.Collection, query: str) -> str:
    """Full RAG pipeline: retrieve → augment → generate."""
    # 1. Retrieve
    chunks = retrieve(collection, query)

    print(f"\n  Retrieved {len(chunks)} chunks:")
    for c in chunks:
        print(f"    [{c['id']}] distance={c['distance']:.3f} source={c['source']}")
        print(f"    {c['text'][:80]}...")

    # 2. Augment — assemble context block
    context_block = "\n\n".join(
        f"[Source: {c['id']} | {c['source']}]\n{c['text']}"
        for c in chunks
    )
    augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}"

    # 3. Generate
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=RAG_SYSTEM,
        messages=[{"role": "user", "content": augmented_query}],
    )
    return response.content[0].text

Why Haiku? This lab uses the smallest Claude model to keep costs low during experimentation. The generation quality depends far more on retrieval quality than model size in a RAG setting. Switch to a larger model only when the context is complex and synthesis quality matters.

Run the full pipeline and test with queries

PYTHON — rag_agent.py (continued)

TEST_QUERIES = [
    "How does a ReAct agent loop work?",
    "What are the four types of memory in agents?",
    "How does DPO differ from RLHF?",
    "What is prompt injection and how is it defended against?",
    "What year was the iPhone 15 released?",  # out-of-corpus query
]


if __name__ == "__main__":
    collection = build_index()

    for query in TEST_QUERIES:
        print(f"\n{'='*60}")
        print(f"QUERY: {query}")
        answer = rag_answer(collection, query)
        print(f"\nANSWER:\n{answer}")

BASH

python rag_agent.py

EXPECTED OUTPUT (abridged)

Index built: 8 chunks indexed.

============================================================
QUERY: How does a ReAct agent loop work?

  Retrieved 3 chunks:
    [chunk_01] distance=0.081 source=section-05-agents
    The ReAct agent interleaves reasoning and action in a loop...
    [chunk_03] distance=0.412 source=section-07-planning
    Plan-and-execute agents separate planning from execution...
    [chunk_02] distance=0.451 source=section-06-memory
    Memory in agents is classified into four types...

ANSWER:
According to [chunk_01 | section-05-agents], a ReAct agent loop works
by interleaving reasoning and action: the model produces a Thought,
selects an Action (a tool call), receives an Observation, and repeats
this cycle until it emits a final answer. This approach was introduced
by Yao et al. (2022).

============================================================
QUERY: What year was the iPhone 15 released?

  Retrieved 3 chunks:
    [chunk_04] distance=0.682 ...
    [chunk_01] distance=0.711 ...
    [chunk_07] distance=0.731 ...

ANSWER:
I don't have that information in the provided context.

What to observe: (1) The first query retrieves chunk_01 with very low distance (0.08) — high confidence match. (2) The out-of-corpus query retrieves chunks with distances above 0.6 — far away semantically — and the model correctly refuses to answer rather than hallucinating. This is the core RAG safety property: the model cannot answer what isn't in the retrieved context.

Extension: add an LLM reranker for multi-chunk queries

Retrieve more candidates than needed (top-6), then ask the LLM to rank them by relevance before assembly. This is the "LLM-as-reranker" pattern — useful when you don't have a dedicated cross-encoder model and the candidate set is small.

PYTHON — add to rag_agent.py

RERANKER_SYSTEM = """You are a relevance ranker.
Given a query and a list of numbered passages, return ONLY a JSON array
of passage numbers in order of relevance to the query, most relevant first.
Example: [2, 5, 1] means passage 2 is most relevant, then 5, then 1.
Return ONLY the JSON array, no other text."""


def llm_rerank(query: str, chunks: list[dict], keep: int = 3) -> list[dict]:
    """Use the LLM to rerank a candidate chunk list."""
    passages = "\n\n".join(
        f"[{i+1}] {c['text'][:200]}" for i, c in enumerate(chunks)
    )
    prompt = f"Query: {query}\n\nPassages:\n{passages}"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=64,
        system=RERANKER_SYSTEM,
        messages=[{"role": "user", "content": prompt}],
    )
    import json
    ranked_indices = json.loads(response.content[0].text.strip())
    # Convert 1-indexed to 0-indexed, cap at available chunks
    reranked = [chunks[i - 1] for i in ranked_indices if 1 <= i <= len(chunks)]
    return reranked[:keep]


def rag_answer_reranked(collection: chromadb.Collection, query: str) -> str:
    """RAG with LLM reranking: retrieve 6 → rerank → take top 3 → generate."""
    candidates = retrieve(collection, query, top_k=6)
    top_chunks = llm_rerank(query, candidates, keep=3)

    print(f"  After reranking, using chunks: {[c['id'] for c in top_chunks]}")

    context_block = "\n\n".join(
        f"[Source: {c['id']} | {c['source']}]\n{c['text']}"
        for c in top_chunks
    )
    augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}"

    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=RAG_SYSTEM,
        messages=[{"role": "user", "content": augmented_query}],
    )
    return response.content[0].text

When reranking helps most: Queries where the initial embedding distance scores are clustered (all between 0.3–0.5) — the retriever found several plausible candidates but couldn't confidently rank them. The LLM's joint understanding of query + passage pairs resolves the tie more accurately than cosine similarity alone.

Finished the theory and completed the lab? Mark this section complete to track your progress.

Last updated: April 5, 2026

Retrieval-AugmentedGeneration

Retrieve, Augment, Generate

How You Split Documents Determines What You Retrieve

Turning Text Into Searchable Geometry

Precision After Recall

What Goes Wrong — and How to Measure It

Verified References

Build a RAG Pipeline from Scratch

Retrieval-Augmented
Generation