SECTION 04 CORE FOUNDATION

Large Language
Models

An agent is only as capable as its reasoning engine. Before you can build agents that plan, decide, and act reliably, you need a precise mental model of what an LLM actually is — how it processes text, what it can and cannot remember, how it is trained to follow instructions, and what controls you have at inference time. This section gives you that foundation without the math you don't need in practice.

📖 Start Theory 📝 Take the Quiz

01 · THE TRANSFORMER ARCHITECTURE

Attention Is All You Need — and Why It Matters for Agents

Every major LLM in use today — from all frontier providers — is built on the Transformer architecture, introduced by Vaswani et al. in 2017. The core innovation was replacing recurrence (processing tokens one at a time, left to right) with self-attention: a mechanism that lets every token in a sequence directly attend to every other token simultaneously.

For agents, the practical implication is immediate: the model does not "read" your prompt sequentially the way a human does. It computes relationships across the entire input in parallel. This is why long, dense context — hundreds of messages, large tool outputs, multi-document inputs — can be processed in a single forward pass, and why the context window is the primary constraint on what an agent can "see" at once.

// TRANSFORMER FORWARD PASS (simplified)

Raw Text → Tokenizer → Token IDs → Embeddings → Self-Attention Layers → Next Token Probability → Sampled Token

Key insight: An LLM is fundamentally a next-token predictor. It takes a sequence of tokens as input and outputs a probability distribution over all tokens in its vocabulary. The model then samples from that distribution to produce the next token. The entire agent loop — reasoning, tool calls, final answers — is this process repeated thousands of times.

🔗

Self-Attention

Each token learns to "pay attention" to the most relevant other tokens in the sequence. This is how the model relates "it" back to "the bank" ten sentences earlier.

FOUNDATIONAL (2017–2026)

📐

Embedding Space

Tokens are mapped to high-dimensional vectors where semantic relationships become geometric ones — "king" minus "man" plus "woman" ≈ "queen".

FOUNDATIONAL (2017–2026)

⚡

Parallel Processing

Unlike RNNs, Transformers process all tokens simultaneously during training (not autoregressive), making large-scale training on GPU clusters feasible.

FOUNDATIONAL (2017–2026)

📎 Sources: Vaswani et al. — Attention Is All You Need (arXiv:1706.03762, 2017) · Hugging Face — Transformers Library Docs

02 · TOKENIZATION

How Text Becomes Numbers — and Why It Matters

LLMs do not process characters or words directly. They process tokens — subword units produced by a learned tokenizer. The most common tokenization algorithm used by frontier models is Byte Pair Encoding (BPE), which iteratively merges the most frequent character pairs until it reaches a fixed vocabulary size (typically 32k–100k tokens).

A practical rule of thumb: 1 token ≈ 4 characters of English text, or roughly ¾ of a word. A 1,000-word document is approximately 1,333 tokens. Code and non-English text tokenize less efficiently — a single Unicode character may map to multiple tokens, which has real cost implications when agents process large outputs from code interpreters or foreign-language documents.

// HOW "agentic AI" TOKENIZES (approximate, BPE)

agent ic AI

→ 3 tokens | "agentic" splits because it is a less common compound, while "agent" and "AI" are high-frequency tokens and stay whole.

Agent implication: When your agent receives a large JSON payload from a tool, every byte counts against the context window. A 50KB API response that looks manageable in your text editor might consume a disproportionate number of tokens if it contains dense code, URLs, or non-ASCII content. Always inspect token counts, not byte sizes, when designing tool output schemas.

🔤

Byte Pair Encoding (BPE)

Starts from individual characters and iteratively merges the most frequent adjacent pairs. Produces a vocabulary that handles any text — including unseen words — by decomposing them into known subwords.

WIDELY USED (2019–2026)

🌐

Multilingual Tokenization

Non-English text is tokenized less efficiently. A sentence in Chinese, Arabic, or code-dense Python may use 2–5× more tokens per semantic unit than English prose. Relevant for agent cost estimation.

COST RISK

📎 Sources: Hugging Face — Tokenizers Docs · OpenAI — Tokenizer (interactive demo) · Sennrich et al. — BPE for NMT (arXiv:1508.07909, 2016)

03 · CONTEXT WINDOWS

The Agent's Working Memory

The context window is the maximum number of tokens a model can process in a single forward pass — both the input (prompt + tool results + history) and the output (generated tokens) combined. It is the hard boundary of what the LLM can "see" and reason over at any moment.

For agents, the context window is their working memory. Everything the agent knows during a given turn — the system prompt, the conversation history, tool results, injected documents, and the current reasoning trace — must fit within it. When the context fills up, information must be compressed, summarized, or evicted, which introduces loss and error.

📄

Short Context

Up to ~32K tokens. Suitable for single-document Q&A, short conversations, and lightweight tool-calling loops. Most cost-efficient for simple agents.

SEE PROVIDER DOCS FOR CURRENT LIMITS

📚

Mid Context

32K–200K tokens. Fits multiple long documents, extended agent trajectories, or code repositories. Enables agents that reason across a full codebase or research paper set.

WIDELY USED (2024–2026)

🏛️

Long Context

200K+ tokens. Designed for agents processing entire books, large codebases, or lengthy audit trails in a single pass. Performance on content near the middle of very long contexts varies by model.

EMERGING (2024–2026)

"Lost in the Middle" problem: Research (Liu et al., 2023) showed that models perform worse on information placed in the middle of very long contexts compared to the beginning or end. When designing agent prompts, put the most critical instructions at the start and key retrieved context near the end. Do not bury important information in the middle of a long tool output.

Prompt caching (2024–2026): Anthropic and other providers now offer prompt caching — a feature where a stable prefix of a long prompt is cached server-side and not recomputed on every call. For agents with a large, static system prompt (tool schemas, instructions, background documents), caching can reduce token costs by 80–90% on repeated calls. See your provider's current docs for availability and pricing.

📎 Sources: Anthropic — Prompt Caching Docs · Liu et al. — Lost in the Middle (arXiv:2307.03172, 2023)

04 · THE TRAINING PIPELINE

From Raw Text to Instruction-Following Agent Brain

Modern LLMs go through a multi-stage training pipeline. Understanding these stages tells you why the model behaves the way it does — and what you can and cannot change at inference time.

// THREE-STAGE TRAINING PIPELINE

STAGE 1

Pretraining

Predict the next token on trillions of tokens of web text, books, and code. The model learns language, facts, and reasoning patterns.

STAGE 2

Supervised Fine-Tuning (SFT)

Train on curated human-written demonstrations of helpful, honest responses. The model learns the instruction-following format.

STAGE 3

RLHF / RLAIF / Constitutional AI

Train a reward model from human (RLHF) or AI (RLAIF/CAI) preferences, then use RL to maximize reward. Aligns the model toward helpfulness, honesty, and safety.

Constitutional AI (CAI), introduced by Anthropic in 2022, extends RLHF by replacing the human preference labeling step with a set of written principles — a "constitution" — that the model uses to critique and revise its own outputs. This makes the alignment process more scalable and transparent. Anthropic applies CAI to train Claude models.

What training fixes vs. what prompting fixes: Training determines the model's values, world knowledge, and base capabilities — you cannot change these at runtime. Prompting (including system prompts) determines the model's behavior in context: its role, output format, what tools it should use, what constraints to apply. Agent builders work entirely in the prompting layer. Training is the model provider's domain.

🧠

RLHF

Reinforcement Learning from Human Feedback. Humans compare model outputs and label which is better. A reward model is trained on those preferences, then PPO is used to tune the LLM toward higher reward.

WIDELY USED (2022–2026)

📜

Constitutional AI (CAI)

Anthropic's approach: a written set of principles guides the model to self-critique and self-revise. AI-generated feedback replaces some human labeling, improving scalability and consistency.

WIDELY USED (2023–2026)

📎 Sources: Bai et al. — Constitutional AI (arXiv:2212.08073, Anthropic 2022) · Ouyang et al. — RLHF / InstructGPT (arXiv:2203.02155, 2022) · Anthropic — Model Overview

05 · INFERENCE CONTROLS

What You Control at Runtime

When you call an LLM API, you have a set of parameters that control how the model generates its output. These are the levers available to agent builders. Understanding each one prevents common mistakes — like setting temperature to 0 for creative tasks or leaving top-p at default when you need deterministic tool calls.

PARAMETER	WHAT IT DOES	RECOMMENDED FOR AGENTS
temperature	Controls randomness. 0 = always pick the most likely token. 1 = sample proportionally. Higher values increase creativity but reduce reliability.	0–0.3 for tool-calling and structured output; 0.7–1 for creative generation tasks
max_tokens	Hard cap on the number of tokens the model will generate in its response. Does not affect input length.	Always set explicitly. Prevents runaway generation and controls cost.
top_p	Nucleus sampling: restrict sampling to the smallest set of tokens whose cumulative probability ≥ p. Lower values = more conservative output.	Usually leave at default (0.95–1.0). Adjust only if tuning creative diversity.
stop sequences	One or more strings that, when generated, cause the model to stop immediately. The stop string itself is not included in the output.	Use to enforce boundaries between reasoning and output blocks in custom agent formats.
tool / function schemas	Structured JSON definitions that constrain the model to produce valid tool calls. The model cannot call tools not defined in the schema.	Always use strict schemas. They prevent tool hallucination and enable reliable agent loops.

Structured outputs (also called "constrained generation" or "JSON mode") force the model to produce output that conforms to a schema at the token level — invalid tokens are masked out before sampling. This is how agent frameworks guarantee that tool call arguments are valid JSON even when temperature is non-zero. Anthropic's tool use API enforces this automatically when tools are defined.

Temperature = 0 is not truly deterministic. Most providers sample from the GPU in a way that can produce slight variation even at temperature 0. For reproducible outputs in tests, use a fixed random seed if the API supports it, and validate outputs programmatically rather than assuming exact string matches.

📎 Sources: Anthropic — Messages API Reference · Anthropic — Tool Use & Structured Outputs

06 · WHAT LLMs CANNOT DO (AND WHY IT MATTERS FOR AGENTS)

Hard Limits You Must Design Around

LLMs are powerful reasoning engines but have well-documented limitations. Agent builders who ignore these ship systems that fail in predictable, embarrassing ways. Design your agents assuming these limitations are permanent — because as of 2026, none have been fully solved.

LIMIT 01

No Persistent State

An LLM has no memory between API calls. Every call is stateless. The illusion of memory in agents is entirely created by appending prior messages to the context. This is your responsibility as the builder.

STRUCTURAL LIMIT

LIMIT 02

Knowledge Cutoff

The model's factual knowledge is frozen at its training cutoff. It cannot know about events after that date unless you provide them in context via tools (web search, retrieval).

STRUCTURAL LIMIT

LIMIT 03

Hallucination

The model generates plausible-sounding but incorrect content — fabricated citations, invented function names, made-up statistics. In an agent loop, one hallucinated tool call can corrupt every downstream step.

ACTIVE RISK (2025–2026)

LIMIT 04

Arithmetic & Symbolic Reasoning

LLMs are unreliable at multi-step arithmetic, precise counting, and formal symbolic reasoning. Mitigate by routing these tasks to a code execution tool rather than relying on the model's reasoning alone.

DESIGN CONSTRAINT

LIMIT 05

No Real-World Awareness

The model cannot browse the internet, check the time, call APIs, or read files unless you explicitly provide those tools. It operates entirely within the context window you construct.

STRUCTURAL LIMIT

LIMIT 06

Context Window is Finite

No matter how large the context window, it is bounded. Long-running agents accumulate history, tool outputs, and reasoning traces that eventually require compression or truncation — with information loss.

DESIGN CONSTRAINT

📎 Sources: Anthropic — Tool Use Docs · Liu et al. — Lost in the Middle (arXiv:2307.03172, 2023)

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source	Type	Covers	Recency
Vaswani et al. — Attention Is All You Need	Academic paper	Transformer architecture, self-attention, positional encoding	2017 (foundational)
Sennrich et al. — BPE for NMT	Academic paper	Byte Pair Encoding tokenization algorithm	2016 (foundational)
Ouyang et al. — InstructGPT / RLHF	Academic paper	Supervised fine-tuning, RLHF pipeline, reward model training	2022
Bai et al. — Constitutional AI	Academic paper (Anthropic)	Constitutional AI, RLAIF, self-critique, harmlessness training	2022
Liu et al. — Lost in the Middle	Academic paper	Long-context performance, positional bias, context window design	2023
Anthropic — Tool Use & Agents Docs	Official docs	Structured outputs, tool schemas, inference controls, hallucination risks	Maintained 2024–2026
Anthropic — Prompt Caching Docs	Official docs	Prompt caching, cost reduction for long static prompts	Maintained 2024–2026
Hugging Face — Tokenizers Docs	Official docs	BPE, WordPiece, tokenizer implementation details	Maintained 2024–2026

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: April 5, 2026

Large LanguageModels

Attention Is All You Need — and Why It Matters for Agents

How Text Becomes Numbers — and Why It Matters

The Agent's Working Memory

From Raw Text to Instruction-Following Agent Brain

What You Control at Runtime

Hard Limits You Must Design Around

Verified References

Section 04 Quiz

Large Language
Models