Course  /  04 · Large Language Models
SECTION 04 CORE FOUNDATION

Large Language
Models

An agent is only as capable as its reasoning engine. Before you can build agents that plan, decide, and act reliably, you need a precise mental model of what an LLM actually is — how it processes text, what it can and cannot remember, how it is trained to follow instructions, and what controls you have at inference time. This section gives you that foundation without the math you don't need in practice.

01 · THE TRANSFORMER ARCHITECTURE

Attention Is All You Need — and Why It Matters for Agents

Every major LLM in use today — from all frontier providers — is built on the Transformer architecture, introduced by Vaswani et al. in 2017. The core innovation was replacing recurrence (processing tokens one at a time, left to right) with self-attention: a mechanism that lets every token in a sequence directly attend to every other token simultaneously.

For agents, the practical implication is immediate: the model does not "read" your prompt sequentially the way a human does. It computes relationships across the entire input in parallel. This is why long, dense context — hundreds of messages, large tool outputs, multi-document inputs — can be processed in a single forward pass, and why the context window is the primary constraint on what an agent can "see" at once.

// TRANSFORMER FORWARD PASS (simplified)
Raw Text Tokenizer Token IDs Embeddings Self-Attention Layers Next Token Probability Sampled Token
Key insight: An LLM is fundamentally a next-token predictor. It takes a sequence of tokens as input and outputs a probability distribution over all tokens in its vocabulary. The model then samples from that distribution to produce the next token. The entire agent loop — reasoning, tool calls, final answers — is this process repeated thousands of times.
🔗
Self-Attention
Each token learns to "pay attention" to the most relevant other tokens in the sequence. This is how the model relates "it" back to "the bank" ten sentences earlier.
FOUNDATIONAL (2017–2026)
📐
Embedding Space
Tokens are mapped to high-dimensional vectors where semantic relationships become geometric ones — "king" minus "man" plus "woman" ≈ "queen".
FOUNDATIONAL (2017–2026)
Parallel Processing
Unlike RNNs, Transformers process all tokens simultaneously during training (not autoregressive), making large-scale training on GPU clusters feasible.
FOUNDATIONAL (2017–2026)
02 · TOKENIZATION

How Text Becomes Numbers — and Why It Matters

LLMs do not process characters or words directly. They process tokens — subword units produced by a learned tokenizer. The most common tokenization algorithm used by frontier models is Byte Pair Encoding (BPE), which iteratively merges the most frequent character pairs until it reaches a fixed vocabulary size (typically 32k–100k tokens).

A practical rule of thumb: 1 token ≈ 4 characters of English text, or roughly ¾ of a word. A 1,000-word document is approximately 1,333 tokens. Code and non-English text tokenize less efficiently — a single Unicode character may map to multiple tokens, which has real cost implications when agents process large outputs from code interpreters or foreign-language documents.

// HOW "agentic AI" TOKENIZES (approximate, BPE)
agent ic  AI
→ 3 tokens  |  "agentic" splits because it is a less common compound, while "agent" and "AI" are high-frequency tokens and stay whole.
Agent implication: When your agent receives a large JSON payload from a tool, every byte counts against the context window. A 50KB API response that looks manageable in your text editor might consume a disproportionate number of tokens if it contains dense code, URLs, or non-ASCII content. Always inspect token counts, not byte sizes, when designing tool output schemas.
🔤
Byte Pair Encoding (BPE)
Starts from individual characters and iteratively merges the most frequent adjacent pairs. Produces a vocabulary that handles any text — including unseen words — by decomposing them into known subwords.
WIDELY USED (2019–2026)
🌐
Multilingual Tokenization
Non-English text is tokenized less efficiently. A sentence in Chinese, Arabic, or code-dense Python may use 2–5× more tokens per semantic unit than English prose. Relevant for agent cost estimation.
COST RISK
03 · CONTEXT WINDOWS

The Agent's Working Memory

The context window is the maximum number of tokens a model can process in a single forward pass — both the input (prompt + tool results + history) and the output (generated tokens) combined. It is the hard boundary of what the LLM can "see" and reason over at any moment.

For agents, the context window is their working memory. Everything the agent knows during a given turn — the system prompt, the conversation history, tool results, injected documents, and the current reasoning trace — must fit within it. When the context fills up, information must be compressed, summarized, or evicted, which introduces loss and error.

📄
Short Context
Up to ~32K tokens. Suitable for single-document Q&A, short conversations, and lightweight tool-calling loops. Most cost-efficient for simple agents.
SEE PROVIDER DOCS FOR CURRENT LIMITS
📚
Mid Context
32K–200K tokens. Fits multiple long documents, extended agent trajectories, or code repositories. Enables agents that reason across a full codebase or research paper set.
WIDELY USED (2024–2026)
🏛️
Long Context
200K+ tokens. Designed for agents processing entire books, large codebases, or lengthy audit trails in a single pass. Performance on content near the middle of very long contexts varies by model.
EMERGING (2024–2026)
"Lost in the Middle" problem: Research (Liu et al., 2023) showed that models perform worse on information placed in the middle of very long contexts compared to the beginning or end. When designing agent prompts, put the most critical instructions at the start and key retrieved context near the end. Do not bury important information in the middle of a long tool output.
Prompt caching (2024–2026): Anthropic and other providers now offer prompt caching — a feature where a stable prefix of a long prompt is cached server-side and not recomputed on every call. For agents with a large, static system prompt (tool schemas, instructions, background documents), caching can reduce token costs by 80–90% on repeated calls. See your provider's current docs for availability and pricing.
04 · THE TRAINING PIPELINE

From Raw Text to Instruction-Following Agent Brain

Modern LLMs go through a multi-stage training pipeline. Understanding these stages tells you why the model behaves the way it does — and what you can and cannot change at inference time.

// THREE-STAGE TRAINING PIPELINE
STAGE 1
Pretraining
Predict the next token on trillions of tokens of web text, books, and code. The model learns language, facts, and reasoning patterns.
STAGE 2
Supervised Fine-Tuning (SFT)
Train on curated human-written demonstrations of helpful, honest responses. The model learns the instruction-following format.
STAGE 3
RLHF / RLAIF / Constitutional AI
Train a reward model from human (RLHF) or AI (RLAIF/CAI) preferences, then use RL to maximize reward. Aligns the model toward helpfulness, honesty, and safety.

Constitutional AI (CAI), introduced by Anthropic in 2022, extends RLHF by replacing the human preference labeling step with a set of written principles — a "constitution" — that the model uses to critique and revise its own outputs. This makes the alignment process more scalable and transparent. Anthropic applies CAI to train Claude models.

What training fixes vs. what prompting fixes: Training determines the model's values, world knowledge, and base capabilities — you cannot change these at runtime. Prompting (including system prompts) determines the model's behavior in context: its role, output format, what tools it should use, what constraints to apply. Agent builders work entirely in the prompting layer. Training is the model provider's domain.
🧠
RLHF
Reinforcement Learning from Human Feedback. Humans compare model outputs and label which is better. A reward model is trained on those preferences, then PPO is used to tune the LLM toward higher reward.
WIDELY USED (2022–2026)
📜
Constitutional AI (CAI)
Anthropic's approach: a written set of principles guides the model to self-critique and self-revise. AI-generated feedback replaces some human labeling, improving scalability and consistency.
WIDELY USED (2023–2026)
05 · INFERENCE CONTROLS

What You Control at Runtime

When you call an LLM API, you have a set of parameters that control how the model generates its output. These are the levers available to agent builders. Understanding each one prevents common mistakes — like setting temperature to 0 for creative tasks or leaving top-p at default when you need deterministic tool calls.

PARAMETER WHAT IT DOES RECOMMENDED FOR AGENTS
temperature Controls randomness. 0 = always pick the most likely token. 1 = sample proportionally. Higher values increase creativity but reduce reliability. 0–0.3 for tool-calling and structured output; 0.7–1 for creative generation tasks
max_tokens Hard cap on the number of tokens the model will generate in its response. Does not affect input length. Always set explicitly. Prevents runaway generation and controls cost.
top_p Nucleus sampling: restrict sampling to the smallest set of tokens whose cumulative probability ≥ p. Lower values = more conservative output. Usually leave at default (0.95–1.0). Adjust only if tuning creative diversity.
stop sequences One or more strings that, when generated, cause the model to stop immediately. The stop string itself is not included in the output. Use to enforce boundaries between reasoning and output blocks in custom agent formats.
tool / function schemas Structured JSON definitions that constrain the model to produce valid tool calls. The model cannot call tools not defined in the schema. Always use strict schemas. They prevent tool hallucination and enable reliable agent loops.

Structured outputs (also called "constrained generation" or "JSON mode") force the model to produce output that conforms to a schema at the token level — invalid tokens are masked out before sampling. This is how agent frameworks guarantee that tool call arguments are valid JSON even when temperature is non-zero. Anthropic's tool use API enforces this automatically when tools are defined.

Temperature = 0 is not truly deterministic. Most providers sample from the GPU in a way that can produce slight variation even at temperature 0. For reproducible outputs in tests, use a fixed random seed if the API supports it, and validate outputs programmatically rather than assuming exact string matches.
06 · WHAT LLMs CANNOT DO (AND WHY IT MATTERS FOR AGENTS)

Hard Limits You Must Design Around

LLMs are powerful reasoning engines but have well-documented limitations. Agent builders who ignore these ship systems that fail in predictable, embarrassing ways. Design your agents assuming these limitations are permanent — because as of 2026, none have been fully solved.

LIMIT 01
No Persistent State
An LLM has no memory between API calls. Every call is stateless. The illusion of memory in agents is entirely created by appending prior messages to the context. This is your responsibility as the builder.
STRUCTURAL LIMIT
LIMIT 02
Knowledge Cutoff
The model's factual knowledge is frozen at its training cutoff. It cannot know about events after that date unless you provide them in context via tools (web search, retrieval).
STRUCTURAL LIMIT
LIMIT 03
Hallucination
The model generates plausible-sounding but incorrect content — fabricated citations, invented function names, made-up statistics. In an agent loop, one hallucinated tool call can corrupt every downstream step.
ACTIVE RISK (2025–2026)
LIMIT 04
Arithmetic & Symbolic Reasoning
LLMs are unreliable at multi-step arithmetic, precise counting, and formal symbolic reasoning. Mitigate by routing these tasks to a code execution tool rather than relying on the model's reasoning alone.
DESIGN CONSTRAINT
LIMIT 05
No Real-World Awareness
The model cannot browse the internet, check the time, call APIs, or read files unless you explicitly provide those tools. It operates entirely within the context window you construct.
STRUCTURAL LIMIT
LIMIT 06
Context Window is Finite
No matter how large the context window, it is bounded. Long-running agents accumulate history, tool outputs, and reasoning traces that eventually require compression or truncation — with information loss.
DESIGN CONSTRAINT
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source Type Covers Recency
Vaswani et al. — Attention Is All You Need Academic paper Transformer architecture, self-attention, positional encoding 2017 (foundational)
Sennrich et al. — BPE for NMT Academic paper Byte Pair Encoding tokenization algorithm 2016 (foundational)
Ouyang et al. — InstructGPT / RLHF Academic paper Supervised fine-tuning, RLHF pipeline, reward model training 2022
Bai et al. — Constitutional AI Academic paper (Anthropic) Constitutional AI, RLAIF, self-critique, harmlessness training 2022
Liu et al. — Lost in the Middle Academic paper Long-context performance, positional bias, context window design 2023
Anthropic — Tool Use & Agents Docs Official docs Structured outputs, tool schemas, inference controls, hallucination risks Maintained 2024–2026
Anthropic — Prompt Caching Docs Official docs Prompt caching, cost reduction for long static prompts Maintained 2024–2026
Hugging Face — Tokenizers Docs Official docs BPE, WordPiece, tokenizer implementation details Maintained 2024–2026
KNOWLEDGE CHECK

Section 04 Quiz

8 questions covering all theory blocks. Select one answer per question, then submit.

📝
Section 04 — Large Language Models
8 QUESTIONS · MULTIPLE CHOICE · UNLIMITED RETRIES
Question 1 of 8
What was the core architectural innovation introduced in the "Attention Is All You Need" paper (Vaswani et al., 2017)?
Question 2 of 8
Using a rough rule of thumb, how many tokens does a 1,000-word English document contain?
Question 3 of 8
Your agent's system prompt is 80,000 tokens and rarely changes between calls. Which feature should you enable to reduce token costs significantly?
Question 4 of 8
The "Lost in the Middle" paper (Liu et al., 2023) found that model performance degrades for information placed where in a long context?
Question 5 of 8
Which training stage is responsible for teaching an LLM to follow instructions and produce helpful responses (as opposed to just predicting text)?
Question 6 of 8
Constitutional AI (CAI), introduced by Anthropic, differs from standard RLHF primarily because:
Question 7 of 8
You are building a tool-calling agent that needs to produce a valid JSON tool call every time, even with slight temperature. What mechanism prevents the model from generating malformed JSON?
Question 8 of 8
An agent needs to compute the compound interest on a loan across 360 months. What is the right approach given LLM limitations?

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: