01 · WHAT IS MACHINE LEARNING
From Rules to Learning
Traditional programming is explicit: you write rules, and the computer follows them. If X, do Y. If Z, do W. Machine learning inverts this. Instead of writing rules, you give the system examples of inputs and outputs, and it figures out the rules itself.
This matters enormously for agents. An agent powered by an LLM can handle instructions it was never explicitly programmed for — because the model learned statistical patterns from vast amounts of text, not from hand-coded rules. Understanding this distinction is what lets you reason about what an LLM can and cannot do reliably.
One sentence definition: Machine learning is the practice of training a mathematical function on data so that it can make useful predictions or decisions on new data it has never seen before.
// TRADITIONAL PROGRAMMING vs MACHINE LEARNING
TRADITIONAL PROGRAMMING
📥 Data: input
📜 Rules: written by programmer
↓ deterministic execution
📤 Output: answers
✓ Predictable ❌ Brittle ❌ Can't generalize
MACHINE LEARNING
📥 Data: input
📤 Answers: labeled examples
↓ training process
📜 Rules: learned by the model
✓ Flexible ✓ Generalizes ✓ Handles novel input
02 · THE THREE TYPES OF ML
Supervised, Unsupervised & Reinforcement Learning
All of machine learning falls into three broad paradigms, each suited to a different type of problem. As an agent builder, you'll encounter all three — LLMs are pre-trained using supervised techniques, aligned using reinforcement learning, and agent memory systems are built on unsupervised embedding techniques.
🏷️
Supervised Learning
Train on labeled examples (input → correct output). The model learns the mapping. Used for classification, regression, and next-token prediction in LLM pre-training.
WIDELY USED (2024–2026)
🔍
Unsupervised Learning
Train on unlabeled data — find hidden structure. Used for clustering, dimensionality reduction, and generating text embeddings for semantic search.
WIDELY USED (2024–2026)
🎮
Reinforcement Learning
An agent takes actions, receives rewards or penalties, and learns to maximize cumulative reward. Foundation of RLHF — the process that aligns LLM behavior with human preferences.
WIDELY USED (2024–2026)
// WHERE EACH TYPE APPEARS IN THE LLM PIPELINE
SUPERVISED
Pre-training: predict the next token given billions of text sequences. Each prediction is a labeled example (correct next token = the label).
REINFORCEMENT
RLHF alignment: human raters score model outputs; the model is trained via PPO to produce higher-rated responses. How instruction-following is learned.
UNSUPERVISED
Embedding models: learn dense vector representations of text meaning. Powers semantic search, RAG pipelines, and agent long-term memory.
03 · NEURAL NETWORKS
The Engine Inside Every Modern AI
A neural network is a mathematical function composed of layers of simple computations. Each layer takes numbers in, applies learned weights and a non-linear activation, and passes numbers to the next layer. During training, these weights are adjusted — via gradient descent and backpropagation — so the network produces the right output for a given input.
Andrej Karpathy's Neural Networks: Zero to Hero series builds these from scratch in Python, starting from a single neuron and ending with a GPT-like transformer. You don't need to implement one to build agents, but understanding the structure helps you reason about behavior, limitations, and costs.
// NEURAL NETWORK — SIMPLIFIED LAYER VIEW
Each connection has a learned weight. Training adjusts all weights to minimize prediction error.
Key terms you'll hear constantly
⚖️
Weights & Parameters
The numbers that define what a network "knows". Frontier models have hundreds of billions to trillions of parameters — exact counts are rarely published by labs.
📉
Loss Function
A score measuring how wrong the model's predictions are. Training minimizes this score. Lower loss = better predictions on training data.
🔄
Gradient Descent
The optimization algorithm that nudges weights in the direction that reduces loss. How the model learns from each batch of examples.
🔁
Backpropagation
The algorithm that computes how much each weight contributed to the prediction error, so gradient descent knows what to update. Karpathy's series builds this from scratch.
📦
Batch Size
How many training examples are processed before updating weights. Larger batches = more stable gradients, higher memory cost.
🔂
Epoch
One full pass through the entire training dataset. Models typically train for multiple epochs until loss converges.
04 · THE TRANSFORMER — THE ARCHITECTURE BEHIND EVERY LLM
Attention Is All You Need
In 2017, Vaswani et al. published Attention Is All You Need, introducing the Transformer architecture. Every LLM you will use — GPT-family models, Claude, Gemini, Llama — is built on this architecture. Understanding its core mechanism, self-attention, is the single most important architectural concept for an agent builder.
Before the Transformer, the dominant architecture for sequence tasks was recurrent neural networks (RNNs). RNNs processed sequences token by token, passing a hidden state forward — but this made them slow to train and poor at long-range dependencies. The Transformer eliminated recurrence entirely and replaced it with attention.
What self-attention does (in plain terms): For each token in a sequence, the model computes how much every other token should "attend to" it — how relevant is each other word when determining the meaning of this word? This lets the model capture long-range relationships in a single step, in parallel, regardless of sequence length.
// SELF-ATTENTION — HOW TOKENS ATTEND TO EACH OTHER
Sentence: "The agent called the API because it was available."
When the model processes "it", self-attention lets it look back at all tokens and determine that "API" is the referent — not "agent":
The
agent
called
the API
because
it ← high attention to "API"
was available
Attention score for "it" → each other token (simplified):
The key components of a Transformer
👁️
Multi-Head Attention
Multiple attention mechanisms run in parallel, each learning to focus on different types of relationships. One head might track syntax; another, coreference.
CORE MECHANISM (2017–2026)
📍
Positional Encoding
Since attention has no inherent sense of order, positional encodings inject information about each token's position in the sequence.
CORE MECHANISM (2017–2026)
🔀
Feed-Forward Layers
After attention, each token's representation passes through a feed-forward network applied independently. Adds capacity for complex transformations.
CORE MECHANISM (2017–2026)
🔢
Decoder-Only Architecture
GPT-family models and Claude use decoder-only transformers — the encoder/decoder split from the original paper is dropped. Generates text autoregressively, one token at a time.
DOMINANT LLM PATTERN (2020–2026)
Why this matters for agents: The context window — the total number of tokens the model can attend to at once — is a direct product of the Transformer architecture. Everything the agent has seen (the goal, tool results, memory, conversation history) must fit in this window. Managing that window efficiently is a core agent design skill.
05 · EMBEDDINGS — THE MOST IMPORTANT CONCEPT FOR AGENTS
How AI Understands Meaning
Computers can't work with words directly — everything must be numbers. An embedding is a dense vector (a list of numbers) that represents the semantic meaning of a piece of text. Words or sentences with similar meaning end up with similar vectors — this is the geometric encoding of semantics. It is the foundation of semantic search, RAG pipelines, and agent long-term memory.
Why this matters for agents: When an agent needs to retrieve relevant memory or documents, it converts the query to an embedding, then finds stored embeddings that are mathematically close to it. This allows the agent to find semantically relevant content, not just exact keyword matches — "budget overrun" and "spending exceeded forecast" will have similar embeddings.
// EMBEDDING SPACE — MEANING AS GEOMETRY
dimension 1 →
dimension 2 ↑
ANIMALS cluster
cat
dog
wolf
ROYALTY cluster
king
queen
prince
PROGRAMMING cluster
python
function
loop
Real embeddings have hundreds to thousands of dimensions — not 2. We visualize in 2D to show the core idea: similar meanings cluster together.
Vector similarity — how agents search memory
To find the most relevant documents, an agent computes the cosine similarity between the query embedding and every stored embedding. Cosine similarity measures the angle between two vectors — the smaller the angle (closer to 1.0), the more semantically related the content.
// SEMANTIC SEARCH IN AN AGENT — END TO END
1. Query: "What did we discuss about the Q3 budget?"
↓ embedding model converts text to vector
2. Query vector: [0.23, -0.81, 0.44, 0.12, ...] (768+ numbers)
↓ cosine similarity against all stored document vectors
3. Top match: "Q3 budget review showed a 12% overrun..." — similarity: 0.94
↓ inject matched content into agent context
4. Agent responds with grounded, relevant information from memory
06 · TOKENIZATION
How LLMs See Text
LLMs don't read characters or words — they read tokens. A token is a subword chunk, roughly 3–4 characters on average in English. The model converts all text to a sequence of integer token IDs before processing. This has direct practical implications for agent builders: every prompt, every tool result, every memory injection is measured and billed in tokens.
// TEXT → TOKENS → TOKEN IDs
Input text:
"Build me an AI agent"
Tokenized (BPE subwords):
"Build"
" me"
" an"
" AI"
" agent"
Token IDs (integers the model actually processes):
[8585, 502, 459, 15592, 8479]
Why token count matters for agent builders: LLMs have a context window limit measured in tokens. Every message, system prompt, tool result, and injected memory chunk consumes tokens. Even on models with very large context windows, a long agent loop with many tool calls can fill the window — and every token also costs money. Designing token-efficient agents is a fundamental production skill.
Context window sizes vary widely and change frequently. The table below shows approximate tiers — always check the current provider documentation before building, as limits change with new model releases.
| Model tier | Typical context window | Rough text equivalent |
| Small / fast models | 8K – 32K tokens | ~6K–24K words / short document |
| Mid-range models | 128K – 200K tokens | ~96K–150K words / full novel |
| Long-context models | 500K – 1M+ tokens | ~375K–750K words / large codebase |
Check current limits: Anthropic · OpenAI · Google
07 · THE LLM TRAINING PIPELINE
Pre-training → Fine-tuning → RLHF
Modern LLMs are not trained in one step. They go through a staged pipeline, and each stage shapes a different aspect of the model's behavior. Understanding this pipeline helps you reason about why models behave the way they do — and when fine-tuning or prompt engineering is the right lever to pull.
🏋️
Pre-training
Training from scratch on vast text corpora via next-token prediction. Builds broad language understanding and world knowledge. Done by AI labs. Costs millions of dollars.
FOUNDATION STAGE
🎯
Supervised Fine-tuning (SFT)
Continuing training on high-quality instruction-following examples. Teaches the model to respond helpfully to human requests rather than just continuing text.
ALIGNMENT STAGE 1
🎖️
RLHF
Human raters compare model outputs and rank them. A reward model is trained on these preferences, then the LLM is optimized (via PPO) to maximize reward scores. Shapes helpfulness, safety, and instruction-following.
ALIGNMENT STAGE 2
⚡
Inference
Running the frozen, trained model on new input to get output. What happens every time you call a model API. Charged per token. All agent loops operate at inference time.
PRODUCTION STAGE
The AI model landscape
Specific model versions change constantly. These categories and their providers are stable — check each provider's current documentation for specific models and pricing.
| Category | Providers (see docs for current models) | Relevance to agents |
| Foundation models |
Anthropic, OpenAI, Google, Meta (Llama) |
The reasoning core of most agents |
| Embedding models |
OpenAI, Voyage AI, Cohere, open-source (sentence-transformers) |
Powers semantic search and memory retrieval |
| Vision / multimodal |
Anthropic, OpenAI, Google — all major frontier models are multimodal |
Agents that process images, PDFs, screenshots |
| Coding agents |
Claude Code (Anthropic), GitHub Copilot (Microsoft), Cursor |
Specialized autonomous coding agents |
| Open-source models |
Meta (Llama series), Mistral, Microsoft (Phi series), Alibaba (Qwen) |
Self-hosted agents for privacy or cost control |
08 · CORE CONCEPTS FOR DEBUGGING AGENTS
What Goes Wrong and Why
A handful of ML concepts come up constantly when an agent misbehaves. Understanding these lets you diagnose root causes instead of guessing.
🌫️
Hallucination
The model generates plausible-sounding but false information. It is optimized for fluent, coherent text — not for verifiable truth. Your #1 enemy as an agent builder.
HIGH IMPACT RISK
🎲
Temperature
Controls sampling randomness. High temperature (0.8–1.0) = creative, varied output. Low temperature (0.0–0.2) = deterministic, consistent. Agents usually use low temperature for tool calls.
KEY INFERENCE PARAMETER
📅
Training Cutoff
Models only know about events up to their training data cutoff. Agents that need current information must use tools (web search, live APIs) to retrieve it.
AGENT DESIGN CONSTRAINT
📚
Overfitting
The model memorizes training examples instead of learning generalizable patterns. Performs well on training data, poorly on new inputs. Fine-tuning on small datasets is especially susceptible.
TRAINING CONCERN
🔮
Emergent Abilities
Capabilities that appear unexpectedly as models scale up — including in-context learning, chain-of-thought reasoning, and multi-step planning. Not designed in; they emerge from scale.
WIDELY OBSERVED (2020–2026)
📐
Scaling Laws
Model performance improves predictably with increases in parameters, training data, and compute. Empirically studied by labs — the foundation for investments in frontier model development.
WIDELY USED (2020–2026)
Hallucination is your #1 enemy as an agent builder. Design agents with grounding mechanisms — retrieval-augmented generation (RAG), tool use, citation requirements — to minimize the chance your agent confidently provides false information and then acts on it.
SOURCES USED IN THIS SECTION
Verified References
Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.
| Source |
Type |
Covers |
Recency |
| Andrej Karpathy — Neural Networks: Zero to Hero |
Video series |
Backprop, gradient descent, embeddings, tokenization (BPE), transformers — all built from scratch |
2022–2023 |
| Jay Alammar — The Illustrated Transformer |
Blog / visual |
Self-attention, multi-head attention, encoder-decoder, positional encoding |
2018 (architecture is timeless) |
| Vaswani et al. — Attention Is All You Need (arXiv:1706.03762) |
Academic paper |
Original Transformer architecture, multi-head attention, positional encoding |
2017 (foundation of all LLMs) |
| fast.ai — Practical Deep Learning for Coders |
Course |
Top-down ML fundamentals: training, overfitting, fine-tuning, practical application |
Maintained 2022–2026 |