02 — AI & ML Fundamentals · Agentic AI Course

01 · WHAT IS MACHINE LEARNING

From Rules to Learning

Traditional programming is explicit: you write rules, and the computer follows them. If X, do Y. If Z, do W. Machine learning inverts this. Instead of writing rules, you give the system examples of inputs and outputs, and it figures out the rules itself.

This matters enormously for agents. An agent powered by an LLM can handle instructions it was never explicitly programmed for — because the model learned statistical patterns from vast amounts of text, not from hand-coded rules. Understanding this distinction is what lets you reason about what an LLM can and cannot do reliably.

One sentence definition: Machine learning is the practice of training a mathematical function on data so that it can make useful predictions or decisions on new data it has never seen before.

// TRADITIONAL PROGRAMMING vs MACHINE LEARNING

TRADITIONAL PROGRAMMING

📥 Data: input

📜 Rules: written by programmer

↓ deterministic execution

📤 Output: answers

✓ Predictable ❌ Brittle ❌ Can't generalize

MACHINE LEARNING

📥 Data: input

📤 Answers: labeled examples

↓ training process

📜 Rules: learned by the model

✓ Flexible ✓ Generalizes ✓ Handles novel input

📎 Sources: Andrej Karpathy — Neural Networks: Zero to Hero · fast.ai — Practical Deep Learning for Coders

02 · THE THREE TYPES OF ML

Supervised, Unsupervised & Reinforcement Learning

All of machine learning falls into three broad paradigms, each suited to a different type of problem. As an agent builder, you'll encounter all three — LLMs are pre-trained using supervised techniques, aligned using reinforcement learning, and agent memory systems are built on unsupervised embedding techniques.

🏷️

Supervised Learning

Train on labeled examples (input → correct output). The model learns the mapping. Used for classification, regression, and next-token prediction in LLM pre-training.

WIDELY USED (2024–2026)

🔍

Unsupervised Learning

Train on unlabeled data — find hidden structure. Used for clustering, dimensionality reduction, and generating text embeddings for semantic search.

WIDELY USED (2024–2026)

🎮

Reinforcement Learning

An agent takes actions, receives rewards or penalties, and learns to maximize cumulative reward. Foundation of RLHF — the process that aligns LLM behavior with human preferences.

WIDELY USED (2024–2026)

// WHERE EACH TYPE APPEARS IN THE LLM PIPELINE

SUPERVISED Pre-training: predict the next token given billions of text sequences. Each prediction is a labeled example (correct next token = the label).

REINFORCEMENT RLHF alignment: human raters score model outputs; the model is trained via PPO to produce higher-rated responses. How instruction-following is learned.

UNSUPERVISED Embedding models: learn dense vector representations of text meaning. Powers semantic search, RAG pipelines, and agent long-term memory.

📎 Source: Andrej Karpathy — Neural Networks: Zero to Hero

03 · NEURAL NETWORKS

The Engine Inside Every Modern AI

A neural network is a mathematical function composed of layers of simple computations. Each layer takes numbers in, applies learned weights and a non-linear activation, and passes numbers to the next layer. During training, these weights are adjusted — via gradient descent and backpropagation — so the network produces the right output for a given input.

Andrej Karpathy's Neural Networks: Zero to Hero series builds these from scratch in Python, starting from a single neuron and ending with a GPT-like transformer. You don't need to implement one to build agents, but understanding the structure helps you reason about behavior, limitations, and costs.

// NEURAL NETWORK — SIMPLIFIED LAYER VIEW

INPUT LAYER

x₁

x₂

x₃

→

HIDDEN LAYERS

h

→

OUTPUT LAYER

y₁

y₂

Each connection has a learned weight. Training adjusts all weights to minimize prediction error.

Key terms you'll hear constantly

⚖️

Weights & Parameters

The numbers that define what a network "knows". Frontier models have hundreds of billions to trillions of parameters — exact counts are rarely published by labs.

📉

Loss Function

A score measuring how wrong the model's predictions are. Training minimizes this score. Lower loss = better predictions on training data.

🔄

Gradient Descent

The optimization algorithm that nudges weights in the direction that reduces loss. How the model learns from each batch of examples.

🔁

Backpropagation

The algorithm that computes how much each weight contributed to the prediction error, so gradient descent knows what to update. Karpathy's series builds this from scratch.

📦

Batch Size

How many training examples are processed before updating weights. Larger batches = more stable gradients, higher memory cost.

🔂

Epoch

One full pass through the entire training dataset. Models typically train for multiple epochs until loss converges.

📎 Source: Andrej Karpathy — Neural Networks: Zero to Hero (builds backprop and transformers from scratch)

04 · THE TRANSFORMER — THE ARCHITECTURE BEHIND EVERY LLM

Attention Is All You Need

In 2017, Vaswani et al. published Attention Is All You Need, introducing the Transformer architecture. Every LLM you will use — GPT-family models, Claude, Gemini, Llama — is built on this architecture. Understanding its core mechanism, self-attention, is the single most important architectural concept for an agent builder.

Before the Transformer, the dominant architecture for sequence tasks was recurrent neural networks (RNNs). RNNs processed sequences token by token, passing a hidden state forward — but this made them slow to train and poor at long-range dependencies. The Transformer eliminated recurrence entirely and replaced it with attention.

What self-attention does (in plain terms): For each token in a sequence, the model computes how much every other token should "attend to" it — how relevant is each other word when determining the meaning of this word? This lets the model capture long-range relationships in a single step, in parallel, regardless of sequence length.

// SELF-ATTENTION — HOW TOKENS ATTEND TO EACH OTHER

Sentence: "The agent called the API because it was available."

When the model processes "it", self-attention lets it look back at all tokens and determine that "API" is the referent — not "agent":

The agent called the API because it ← high attention to "API" was available

Attention score for "it" → each other token (simplified):

The

0.03

agent

0.18

API

0.68 ← highest

because

0.07

The key components of a Transformer

👁️

Multi-Head Attention

Multiple attention mechanisms run in parallel, each learning to focus on different types of relationships. One head might track syntax; another, coreference.

CORE MECHANISM (2017–2026)

📍

Positional Encoding

Since attention has no inherent sense of order, positional encodings inject information about each token's position in the sequence.

CORE MECHANISM (2017–2026)

🔀

Feed-Forward Layers

After attention, each token's representation passes through a feed-forward network applied independently. Adds capacity for complex transformations.

CORE MECHANISM (2017–2026)

🔢

Decoder-Only Architecture

GPT-family models and Claude use decoder-only transformers — the encoder/decoder split from the original paper is dropped. Generates text autoregressively, one token at a time.

DOMINANT LLM PATTERN (2020–2026)

Why this matters for agents: The context window — the total number of tokens the model can attend to at once — is a direct product of the Transformer architecture. Everything the agent has seen (the goal, tool results, memory, conversation history) must fit in this window. Managing that window efficiently is a core agent design skill.

📎 Sources: Vaswani et al. — Attention Is All You Need (arXiv:1706.03762, 2017) · Jay Alammar — The Illustrated Transformer

05 · EMBEDDINGS — THE MOST IMPORTANT CONCEPT FOR AGENTS

How AI Understands Meaning

Computers can't work with words directly — everything must be numbers. An embedding is a dense vector (a list of numbers) that represents the semantic meaning of a piece of text. Words or sentences with similar meaning end up with similar vectors — this is the geometric encoding of semantics. It is the foundation of semantic search, RAG pipelines, and agent long-term memory.

Why this matters for agents: When an agent needs to retrieve relevant memory or documents, it converts the query to an embedding, then finds stored embeddings that are mathematically close to it. This allows the agent to find semantically relevant content, not just exact keyword matches — "budget overrun" and "spending exceeded forecast" will have similar embeddings.

// EMBEDDING SPACE — MEANING AS GEOMETRY

dimension 1 →

dimension 2 ↑

ANIMALS cluster

cat dog wolf

ROYALTY cluster

king queen prince

PROGRAMMING cluster

python function loop

Real embeddings have hundreds to thousands of dimensions — not 2. We visualize in 2D to show the core idea: similar meanings cluster together.

Vector similarity — how agents search memory

To find the most relevant documents, an agent computes the cosine similarity between the query embedding and every stored embedding. Cosine similarity measures the angle between two vectors — the smaller the angle (closer to 1.0), the more semantically related the content.

// SEMANTIC SEARCH IN AN AGENT — END TO END

1. Query: "What did we discuss about the Q3 budget?"

↓ embedding model converts text to vector

2. Query vector: [0.23, -0.81, 0.44, 0.12, ...] (768+ numbers)

↓ cosine similarity against all stored document vectors

3. Top match: "Q3 budget review showed a 12% overrun..." — similarity: 0.94

↓ inject matched content into agent context

4. Agent responds with grounded, relevant information from memory

📎 Sources: Jay Alammar — The Illustrated Transformer · Karpathy — Zero to Hero (embeddings chapter)

06 · TOKENIZATION

How LLMs See Text

LLMs don't read characters or words — they read tokens. A token is a subword chunk, roughly 3–4 characters on average in English. The model converts all text to a sequence of integer token IDs before processing. This has direct practical implications for agent builders: every prompt, every tool result, every memory injection is measured and billed in tokens.

// TEXT → TOKENS → TOKEN IDs

Input text:

"Build me an AI agent"

Tokenized (BPE subwords):

"Build" " me" " an" " AI" " agent"

Token IDs (integers the model actually processes):

[8585, 502, 459, 15592, 8479]

Why token count matters for agent builders: LLMs have a context window limit measured in tokens. Every message, system prompt, tool result, and injected memory chunk consumes tokens. Even on models with very large context windows, a long agent loop with many tool calls can fill the window — and every token also costs money. Designing token-efficient agents is a fundamental production skill.

Context window sizes vary widely and change frequently. The table below shows approximate tiers — always check the current provider documentation before building, as limits change with new model releases.

Model tier	Typical context window	Rough text equivalent
Small / fast models	8K – 32K tokens	~6K–24K words / short document
Mid-range models	128K – 200K tokens	~96K–150K words / full novel
Long-context models	500K – 1M+ tokens	~375K–750K words / large codebase

Check current limits: Anthropic · OpenAI · Google

📎 Source: Karpathy — Zero to Hero (tokenization, BPE from scratch)

07 · THE LLM TRAINING PIPELINE

Pre-training → Fine-tuning → RLHF

Modern LLMs are not trained in one step. They go through a staged pipeline, and each stage shapes a different aspect of the model's behavior. Understanding this pipeline helps you reason about why models behave the way they do — and when fine-tuning or prompt engineering is the right lever to pull.

🏋️

Pre-training

Training from scratch on vast text corpora via next-token prediction. Builds broad language understanding and world knowledge. Done by AI labs. Costs millions of dollars.

FOUNDATION STAGE

🎯

Supervised Fine-tuning (SFT)

Continuing training on high-quality instruction-following examples. Teaches the model to respond helpfully to human requests rather than just continuing text.

ALIGNMENT STAGE 1

🎖️

RLHF

Human raters compare model outputs and rank them. A reward model is trained on these preferences, then the LLM is optimized (via PPO) to maximize reward scores. Shapes helpfulness, safety, and instruction-following.

ALIGNMENT STAGE 2

⚡

Inference

Running the frozen, trained model on new input to get output. What happens every time you call a model API. Charged per token. All agent loops operate at inference time.

PRODUCTION STAGE

The AI model landscape

Specific model versions change constantly. These categories and their providers are stable — check each provider's current documentation for specific models and pricing.

Category	Providers (see docs for current models)	Relevance to agents
Foundation models	Anthropic, OpenAI, Google, Meta (Llama)	The reasoning core of most agents
Embedding models	OpenAI, Voyage AI, Cohere, open-source (sentence-transformers)	Powers semantic search and memory retrieval
Vision / multimodal	Anthropic, OpenAI, Google — all major frontier models are multimodal	Agents that process images, PDFs, screenshots
Coding agents	Claude Code (Anthropic), GitHub Copilot (Microsoft), Cursor	Specialized autonomous coding agents
Open-source models	Meta (Llama series), Mistral, Microsoft (Phi series), Alibaba (Qwen)	Self-hosted agents for privacy or cost control

📎 Sources: Anthropic Docs · OpenAI Docs · fast.ai — Practical Deep Learning

08 · CORE CONCEPTS FOR DEBUGGING AGENTS

What Goes Wrong and Why

A handful of ML concepts come up constantly when an agent misbehaves. Understanding these lets you diagnose root causes instead of guessing.

🌫️

Hallucination

The model generates plausible-sounding but false information. It is optimized for fluent, coherent text — not for verifiable truth. Your #1 enemy as an agent builder.

HIGH IMPACT RISK

🎲

Temperature

Controls sampling randomness. High temperature (0.8–1.0) = creative, varied output. Low temperature (0.0–0.2) = deterministic, consistent. Agents usually use low temperature for tool calls.

KEY INFERENCE PARAMETER

📅

Training Cutoff

Models only know about events up to their training data cutoff. Agents that need current information must use tools (web search, live APIs) to retrieve it.

AGENT DESIGN CONSTRAINT

📚

Overfitting

The model memorizes training examples instead of learning generalizable patterns. Performs well on training data, poorly on new inputs. Fine-tuning on small datasets is especially susceptible.

TRAINING CONCERN

🔮

Emergent Abilities

Capabilities that appear unexpectedly as models scale up — including in-context learning, chain-of-thought reasoning, and multi-step planning. Not designed in; they emerge from scale.

WIDELY OBSERVED (2020–2026)

📐

Scaling Laws

Model performance improves predictably with increases in parameters, training data, and compute. Empirically studied by labs — the foundation for investments in frontier model development.

WIDELY USED (2020–2026)

Hallucination is your #1 enemy as an agent builder. Design agents with grounding mechanisms — retrieval-augmented generation (RAG), tool use, citation requirements — to minimize the chance your agent confidently provides false information and then acts on it.

📎 Sources: Karpathy — Zero to Hero · fast.ai — Practical Deep Learning

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source	Type	Covers	Recency
Andrej Karpathy — Neural Networks: Zero to Hero	Video series	Backprop, gradient descent, embeddings, tokenization (BPE), transformers — all built from scratch	2022–2023
Jay Alammar — The Illustrated Transformer	Blog / visual	Self-attention, multi-head attention, encoder-decoder, positional encoding	2018 (architecture is timeless)
Vaswani et al. — Attention Is All You Need (arXiv:1706.03762)	Academic paper	Original Transformer architecture, multi-head attention, positional encoding	2017 (foundation of all LLMs)
fast.ai — Practical Deep Learning for Coders	Course	Top-down ML fundamentals: training, overfitting, fine-tuning, practical application	Maintained 2022–2026

AI & MLFundamentals

From Rules to Learning

Supervised, Unsupervised & Reinforcement Learning

The Engine Inside Every Modern AI

Key terms you'll hear constantly

Attention Is All You Need

The key components of a Transformer

How AI Understands Meaning

Vector similarity — how agents search memory

How LLMs See Text

Pre-training → Fine-tuning → RLHF

The AI model landscape

What Goes Wrong and Why

Verified References

Section 02 Quiz

AI & ML
Fundamentals