Retrieval-Augmented
Generation
An LLM's knowledge is frozen at training time. RAG breaks that constraint: instead of relying on weights alone, the agent retrieves relevant documents at query time and injects them into the context before generating. This section covers the full pipeline — chunking, embedding, vector search, and reranking — and the lab builds a working RAG system from scratch using ChromaDB and the Anthropic API.
Retrieve, Augment, Generate
Retrieval-Augmented Generation (RAG) was formalized by Lewis et al. (2020) as a method for grounding LLM outputs in a document corpus. The core idea: at inference time, retrieve the most relevant documents for the user's query, inject them into the prompt as context, and then let the LLM generate a response that is grounded in that retrieved evidence — rather than relying purely on parametric knowledge baked into the weights during training.
How You Split Documents Determines What You Retrieve
Before indexing, documents must be split into chunks — the units that will be embedded and stored. Chunk size and splitting strategy have an outsized effect on retrieval quality. Too large and chunks are noisy (irrelevant text dilutes the signal). Too small and chunks lose surrounding context needed to interpret them correctly.
| Strategy | How It Works | Best For | Trade-off |
|---|---|---|---|
| Fixed-size | Split every N tokens, with optional overlap (e.g., 512 tokens, 50 overlap) | Simple uniform corpora, quick prototype | May cut sentences mid-thought; overlap creates duplication |
| Recursive character | Split on paragraphs → sentences → words until chunks are under max size | General-purpose; default in LangChain | Still size-based; does not respect semantic structure |
| Sentence-window | Embed each sentence individually; retrieve with ±N surrounding sentences for context | Precise retrieval + full surrounding context | More chunks; context assembly logic required |
| Semantic chunking | Embed sentences; split when embedding cosine similarity drops below a threshold | Corpora with topic shifts mid-document | Slower ingestion; variable chunk sizes |
| Document structure | Split on HTML/Markdown headers, PDF sections, code blocks | Structured docs (wikis, API refs, PDFs) | Requires parser per format; sections vary wildly in size |
Turning Text Into Searchable Geometry
An embedding model converts a piece of text into a dense vector of floating-point numbers — a point in a high-dimensional space (typically 768–3,072 dimensions) where semantically similar texts are geometrically close. This is the foundation of semantic search: instead of matching keywords, you match meaning.
Retrieval works by embedding the query, then finding the stored chunk vectors nearest to it. The two dominant similarity metrics are cosine similarity (angle between vectors — scale-invariant) and dot product (angle × magnitude — better when embedding norms carry signal). Cosine similarity is the default in most vector stores.
all-MiniLM-L6-v2 is fast and compact (22M params). For multilingual or higher-stakes use, larger models (Voyage AI, OpenAI text-embedding-3, Cohere embed-v3) outperform significantly. Anthropic recommends Voyage AI for embeddings used with Claude.
Precision After Recall
Vector search is optimized for recall — returning everything that might be relevant. But recall ≠ precision. Top-k results often include chunks that are tangentially related, duplicative, or poorly ordered. Reranking is a second-pass step that takes the candidate set from the retriever and scores each chunk against the query with higher precision.
A small transformer encodes the (query, chunk) pair jointly — unlike bi-encoders that embed query and chunk independently. Cross-encoders produce more accurate relevance scores but are too slow to run over the entire corpus, so they are applied only to the top-k candidates from the first-stage retriever. Cohere Rerank, Voyage AI reranking, and local cross-encoders (sentence-transformers) are common options.
For small candidate sets, ask the LLM itself to rank passages by relevance. "Given these 10 passages and the query, rank them from most to least relevant." Slower and more expensive than a dedicated reranker, but useful when no reranking model is available and the candidate set is small (3–10 chunks).
Once reranking is done, the top chunks are assembled into a context block for the prompt. Key assembly decisions:
- Order: Place the most relevant chunk first (primacy effect) or last (recency effect). Avoid burying high-signal chunks in the middle (Lost in the Middle, Section 04).
- Deduplication: Chunks from the same document may overlap. Hash or fuzzy-deduplicate before assembly.
- Citation markers: Tag each chunk with a source ID so the LLM can cite it:
[Source: doc_id, chunk_3]. - Budget: Reserve enough context for the query, the system prompt, and the generated response. Truncate lower-ranked chunks first when context is tight.
What Goes Wrong — and How to Measure It
RAG introduces two compounding failure modes: retrieval failures (wrong chunks retrieved) and generation failures (wrong answer despite correct retrieval). A robust RAG pipeline must measure both independently.
The RAG eval framework from Gao et al. (2023) measures three dimensions independently:
| Metric | What It Measures | How to Compute |
|---|---|---|
| Context Recall | Did retrieval surface the answer-bearing chunks? | Check if ground-truth answer source appears in retrieved set |
| Context Precision | What fraction of retrieved chunks are relevant? | Label retrieved chunks as relevant/irrelevant; compute precision |
| Answer Faithfulness | Is the generated answer grounded in the retrieved context? | LLM-as-judge: does every claim in the answer appear in the retrieved chunks? |
| Answer Relevance | Does the answer address the original question? | LLM-as-judge or human eval on generated output vs. query |
Verified References
Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Lewis et al. — Retrieval-Augmented Generation | Academic paper (Meta) | Original RAG formulation — retrieval, augmentation, generation pipeline | 2020 (foundational) |
| Gao et al. — RAG Survey | Academic survey | Naive, Advanced, and Modular RAG; chunking; evaluation framework | 2023 |
| Karpukhin et al. — Dense Passage Retrieval (DPR) | Academic paper (Meta) | Dense retrieval, bi-encoder architecture, ANN search | 2020 (foundational) |
| Liu et al. — Lost in the Middle | Academic paper | Context position bias, chunk ordering in RAG context assembly | 2023 |
| ChromaDB Documentation | Official docs | Vector store setup, collections, embedding functions, metadata filtering | Maintained 2023–2026 |
| Anthropic — Embeddings (Voyage AI) | Official docs | Embedding model recommendations for use with Claude | Maintained 2024–2026 |
Build a RAG Pipeline from Scratch
You will build a working RAG system in Python: chunk a small document corpus, index it in ChromaDB, retrieve with semantic search, and generate grounded answers using the Anthropic API. The complete script is rag_agent.py.
ChromaDB ships its own embedding model (sentence-transformers), so no separate embedding API key is required for this lab. All you need is your Anthropic API key for the generation step.
pip install chromadb anthropic
touch rag_agent.py
all-MiniLM-L6-v2 — a compact (22M parameter) sentence-transformer that runs locally. No internet access or embedding API key required during retrieval.
We use a small hardcoded corpus of paragraphs about agentic AI topics. Each paragraph is one chunk. In a real system you would load documents from files, PDFs, or a database and split them using a proper chunking strategy.
import os import chromadb import anthropic client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"]) # Our corpus: each entry is one "document chunk" CORPUS = [ { "id": "chunk_01", "text": ( "ReAct agents interleave reasoning and action in a loop. " "The model produces a Thought, selects an Action (tool call), " "receives an Observation, and repeats until it emits a final answer. " "This was introduced by Yao et al. (2022) in arXiv:2210.03629." ), "source": "section-05-agents", }, { "id": "chunk_02", "text": ( "Memory in agents is classified into four types: in-context (the active prompt window), " "external semantic (vector stores), episodic (structured logs of past sessions), " "and procedural (persistent rules embedded in the system prompt). " "This taxonomy comes from Lilian Weng's 2023 survey of LLM-powered agents." ), "source": "section-06-memory", }, { "id": "chunk_03", "text": ( "Plan-and-execute agents separate planning from execution. A planner LLM generates " "a full step list upfront. An executor LLM works through each step using tools. " "When a step fails, the planner is called again with a context summary to replan. " "This pattern is documented in Wang et al. (2023), arXiv:2305.04091." ), "source": "section-07-planning", }, { "id": "chunk_04", "text": ( "Prompt injection is the top LLM-specific vulnerability in the OWASP LLM Top 10. " "An attacker embeds instructions in user-controlled content (e.g., a webpage or file) " "that override the agent's system prompt and redirect its behavior. " "Defences include input delimiting, privilege separation, and instruction anchoring." ), "source": "section-08-prompting", }, { "id": "chunk_05", "text": ( "RLHF has three stages: SFT (supervised fine-tuning on demonstration data), " "reward model training (on human preference rankings of response pairs), " "and PPO fine-tuning (updating the policy to maximize the reward model's score). " "A KL-divergence penalty prevents the model from drifting too far from the SFT baseline." ), "source": "section-09-rl", }, { "id": "chunk_06", "text": ( "Constitutional AI (Bai et al., 2022) replaces the human preference labeling step in RLHF " "with AI-generated feedback guided by a written set of principles. " "The AI critiques and revises its own responses according to the constitution, " "and AI-generated preference rankings replace human annotators in the RLAIF stage." ), "source": "section-09-rl", }, { "id": "chunk_07", "text": ( "DPO (Direct Preference Optimization, Rafailov et al. 2023) eliminates the separate " "reward model and PPO training loop from RLHF. It reparametrizes the reward in terms " "of the policy itself, training directly on preference pairs via binary cross-entropy. " "DPO is now the dominant alignment fine-tuning method for open-weight models." ), "source": "section-09-rl", }, { "id": "chunk_08", "text": ( "Transformers use self-attention to process all tokens in a sequence in parallel. " "Every token attends to every other token simultaneously, computing a weighted sum " "of value vectors based on query-key dot products. This replaced recurrent (RNN) processing " "and is the architectural foundation of all modern large language models." ), "source": "section-04-llms", }, ]
Create an in-memory ChromaDB collection and add all corpus chunks. ChromaDB automatically embeds each chunk's text using the default sentence-transformer model.
def build_index() -> chromadb.Collection: """Build an in-memory ChromaDB collection from the corpus.""" db = chromadb.Client() # ephemeral in-memory client collection = db.create_collection( name="course_knowledge", metadata={"hnsw:space": "cosine"} # cosine similarity ) collection.add( ids=[chunk["id"] for chunk in CORPUS], documents=[chunk["text"] for chunk in CORPUS], metadatas=[{"source": chunk["source"]} for chunk in CORPUS], ) print(f"Index built: {collection.count()} chunks indexed.") return collection
chromadb.PersistentClient(path="./chroma_db") to save the index to disk. Only run build_index() when new documents are added; load the existing collection on subsequent runs.
The retriever embeds the query and returns the top-k most similar chunks. The generator assembles them into a context block and calls the Anthropic API, instructed to answer only from the provided context.
TOP_K = 3 RAG_SYSTEM = """You are a precise question-answering assistant. Answer the user's question using ONLY the information in the provided context blocks. Each context block is labeled with a source ID. If the answer is not in the context, say: "I don't have that information in the provided context." Do not use outside knowledge. Cite the source ID when you use it.""" def retrieve(collection: chromadb.Collection, query: str, top_k: int = TOP_K) -> list[dict]: """Retrieve the top-k most relevant chunks for a query.""" results = collection.query( query_texts=[query], n_results=top_k, include=["documents", "metadatas", "distances"], ) chunks = [] for i in range(len(results["ids"][0])): chunks.append({ "id": results["ids"][0][i], "text": results["documents"][0][i], "source": results["metadatas"][0][i]["source"], "distance": results["distances"][0][i], }) return chunks def rag_answer(collection: chromadb.Collection, query: str) -> str: """Full RAG pipeline: retrieve → augment → generate.""" # 1. Retrieve chunks = retrieve(collection, query) print(f"\n Retrieved {len(chunks)} chunks:") for c in chunks: print(f" [{c['id']}] distance={c['distance']:.3f} source={c['source']}") print(f" {c['text'][:80]}...") # 2. Augment — assemble context block context_block = "\n\n".join( f"[Source: {c['id']} | {c['source']}]\n{c['text']}" for c in chunks ) augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}" # 3. Generate response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=512, system=RAG_SYSTEM, messages=[{"role": "user", "content": augmented_query}], ) return response.content[0].text
TEST_QUERIES = [
"How does a ReAct agent loop work?",
"What are the four types of memory in agents?",
"How does DPO differ from RLHF?",
"What is prompt injection and how is it defended against?",
"What year was the iPhone 15 released?", # out-of-corpus query
]
if __name__ == "__main__":
collection = build_index()
for query in TEST_QUERIES:
print(f"\n{'='*60}")
print(f"QUERY: {query}")
answer = rag_answer(collection, query)
print(f"\nANSWER:\n{answer}")
python rag_agent.py
Index built: 8 chunks indexed.
============================================================
QUERY: How does a ReAct agent loop work?
Retrieved 3 chunks:
[chunk_01] distance=0.081 source=section-05-agents
The ReAct agent interleaves reasoning and action in a loop...
[chunk_03] distance=0.412 source=section-07-planning
Plan-and-execute agents separate planning from execution...
[chunk_02] distance=0.451 source=section-06-memory
Memory in agents is classified into four types...
ANSWER:
According to [chunk_01 | section-05-agents], a ReAct agent loop works
by interleaving reasoning and action: the model produces a Thought,
selects an Action (a tool call), receives an Observation, and repeats
this cycle until it emits a final answer. This approach was introduced
by Yao et al. (2022).
============================================================
QUERY: What year was the iPhone 15 released?
Retrieved 3 chunks:
[chunk_04] distance=0.682 ...
[chunk_01] distance=0.711 ...
[chunk_07] distance=0.731 ...
ANSWER:
I don't have that information in the provided context.
Retrieve more candidates than needed (top-6), then ask the LLM to rank them by relevance before assembly. This is the "LLM-as-reranker" pattern — useful when you don't have a dedicated cross-encoder model and the candidate set is small.
RERANKER_SYSTEM = """You are a relevance ranker. Given a query and a list of numbered passages, return ONLY a JSON array of passage numbers in order of relevance to the query, most relevant first. Example: [2, 5, 1] means passage 2 is most relevant, then 5, then 1. Return ONLY the JSON array, no other text.""" def llm_rerank(query: str, chunks: list[dict], keep: int = 3) -> list[dict]: """Use the LLM to rerank a candidate chunk list.""" passages = "\n\n".join( f"[{i+1}] {c['text'][:200]}" for i, c in enumerate(chunks) ) prompt = f"Query: {query}\n\nPassages:\n{passages}" response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=64, system=RERANKER_SYSTEM, messages=[{"role": "user", "content": prompt}], ) import json ranked_indices = json.loads(response.content[0].text.strip()) # Convert 1-indexed to 0-indexed, cap at available chunks reranked = [chunks[i - 1] for i in ranked_indices if 1 <= i <= len(chunks)] return reranked[:keep] def rag_answer_reranked(collection: chromadb.Collection, query: str) -> str: """RAG with LLM reranking: retrieve 6 → rerank → take top 3 → generate.""" candidates = retrieve(collection, query, top_k=6) top_chunks = llm_rerank(query, candidates, keep=3) print(f" After reranking, using chunks: {[c['id'] for c in top_chunks]}") context_block = "\n\n".join( f"[Source: {c['id']} | {c['source']}]\n{c['text']}" for c in top_chunks ) augmented_query = f"Context:\n{context_block}\n\nQuestion: {query}" response = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=512, system=RAG_SYSTEM, messages=[{"role": "user", "content": augmented_query}], ) return response.content[0].text