SECTION 09 CORE THEORY

RL & Self-
Improvement

LLMs don't just learn from text — they learn from feedback. Reinforcement Learning from Human Feedback (RLHF) is what separates a next-token predictor from a helpful assistant. And self-improvement loops — where agents reflect on failures and revise their behavior without gradient updates — are what separate brittle bots from adaptive agents. This section gives you the mental model behind both.

📖 Start Theory 📝 Take the Quiz

01 · THE RL FEEDBACK LOOP

Agents, Environments, and Reward Signals

Reinforcement Learning (RL) describes a training paradigm where an agent takes actions in an environment, receives a reward signal, and updates its policy to maximize cumulative future reward. Unlike supervised learning — which needs labeled input-output pairs — RL only requires a reward function that evaluates outputs after the fact.

The core loop is: observe state → select action → receive reward → update policy. Formally, this is modeled as a Markov Decision Process (MDP): at each step, the policy maps the current state to a probability distribution over actions, and the environment returns a new state and a scalar reward.

TERM 01

Policy (π)

The agent's decision function — maps a state to an action (or a distribution over actions). For an LLM, the policy is the model weights that determine what token to output next.

FOUNDATIONAL CONCEPT

TERM 02

Reward Signal

A scalar value provided after each action (or at episode end) that tells the agent how well it did. The reward function is the most critical design decision in any RL system — it determines what the agent optimizes for.

FOUNDATIONAL CONCEPT

TERM 03

Environment

Everything outside the agent. For a game-playing agent it is the game engine. For an LLM agent it is the tools, APIs, file systems, and user turns — everything that returns observations in response to actions.

FOUNDATIONAL CONCEPT

TERM 04

PPO (Proximal Policy Optimization)

The RL algorithm most commonly used to train LLMs on reward signals. PPO updates the policy while preventing it from drifting too far from the previous policy in a single step — the "proximal" constraint stabilizes training.

WIDELY USED (2022–2026)

Analogy: RL is like training a dog. You don't show the dog a labeled dataset of "correct sits." You reward the dog when it sits, withhold treats when it doesn't, and the dog's behavior converges toward more sitting. The reward function is the trainer's intent — if you accidentally reward jumping up instead of sitting, you get a jumpy dog. The same misalignment risk applies to LLM reward models.

📎 Sources: Ouyang et al. — InstructGPT (arXiv:2203.02155, 2022) · OpenAI Spinning Up — PPO

02 · RLHF

Reinforcement Learning from Human Feedback

Pretraining teaches an LLM to predict text. But predicting text is not the same as being helpful, honest, or harmless. RLHF — introduced at scale in the InstructGPT paper (Ouyang et al., 2022) — is the training pipeline that bridges this gap. It has three stages:

STAGE 1 — SUPERVISED FINE-TUNING (SFT)

Human labelers write high-quality example responses to a diverse set of prompts. The model is fine-tuned on these demonstration pairs using standard supervised learning. This gives the model the right output format and style before RL begins.

STAGE 2 — REWARD MODEL TRAINING

For each prompt, the SFT model generates multiple candidate responses. Human labelers rank these responses from best to worst. A separate neural network — the reward model (RM) — is trained to predict these preference rankings, outputting a scalar score for any given (prompt, response) pair.

STAGE 3 — RL FINE-TUNING (PPO)

The SFT model (now the policy) generates responses to prompts sampled from the training distribution. Each response is scored by the frozen reward model. PPO updates the policy weights to maximize reward — while a KL-divergence penalty prevents the policy from drifting too far from the SFT model and collapsing into reward-hacking behavior.

Why KL divergence matters: Without the KL penalty, the policy would quickly learn to generate outputs that score high on the reward model but look nothing like natural language — a phenomenon called reward hacking. The penalty anchors the RL-trained model to the SFT baseline, acting as a regularizer.

The reward model is trained on human preferences (ranking A over B), not on absolute quality labels. This is cheaper to collect and more reliable — it is easier for humans to say "this response is better than that one" than to assign a number from 1 to 10. The Elo-style comparisons also scale across annotators with differing absolute standards.

📎 Sources: Ouyang et al. — InstructGPT / RLHF (arXiv:2203.02155, 2022)

03 · CONSTITUTIONAL AI & RLAIF

Replacing Human Labels with AI Feedback

Human preference labeling is expensive, slow, and inconsistent across annotators. Constitutional AI (CAI), introduced by Anthropic in Bai et al. (2022), proposes a partial solution: replace the human preference labeling step with AI-generated feedback, guided by a written set of principles called a constitution.

The CAI pipeline has two phases:

PHASE 1 — SUPERVISED CRITIQUE & REVISION (SL-CAI)

The model generates a response. A second model call critiques it against the constitutional principles ("Is this response harmful according to principle 3?") and then revises it. This revision becomes training data for a new SFT stage.

PHASE 2 — RLAIF (RL from AI Feedback)

For RLHF's preference ranking step, AI models evaluate pairs of responses according to constitutional principles — producing preference labels at scale, without human raters. A reward model is trained on these AI-generated preferences.

Dimension	Standard RLHF	Constitutional AI (CAI)
Preference labels	Human raters	AI model guided by written principles
Scalability	Bottlenecked by human capacity	Scales with compute, not headcount
Transparency	Human preferences are implicit	Principles are explicit and auditable
Consistency	Varies across annotators	More consistent — same model, same principles
Limitation	Human annotation cost and speed	AI feedback quality depends on the model used as critic

Not a replacement: CAI reduces but does not eliminate the need for human judgment. The constitutional principles themselves must be written and reviewed by humans. The AI evaluator's quality sets the ceiling on feedback quality.

📎 Sources: Bai et al. — Constitutional AI (arXiv:2212.08073, 2022) · Anthropic — Constitutional AI Blog

04 · DPO — DIRECT PREFERENCE OPTIMIZATION

Dropping the RL in RLHF

RLHF is effective but complex: it requires training a separate reward model, running PPO (which is notoriously unstable), and carefully tuning the KL penalty. Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023), achieves the same alignment objective — training the model to produce preferred responses — using a single supervised learning step, with no RL loop and no explicit reward model.

DPO's key insight: there is a closed-form relationship between the optimal policy and the reward function. You can reparametrize the reward in terms of the policy itself, and then directly optimize the policy on preference pairs using a binary cross-entropy loss — without ever constructing a reward model or running PPO.

ADVANTAGE 01

No Reward Model Required

DPO optimizes directly on preference pairs (winner, loser) without training a separate scorer. This eliminates a large source of complexity and error in the RLHF pipeline.

EMERGING PATTERN (2023–2026)

ADVANTAGE 02

No PPO Instability

PPO requires careful hyperparameter tuning — clip ratio, learning rate, mini-batch size — and can destabilize training. DPO is a supervised update and is generally more stable and reproducible.

EMERGING PATTERN (2023–2026)

LIMITATION 01

Requires Preference Pairs

DPO still needs a dataset of (prompt, chosen response, rejected response) pairs — the same data used to train RLHF reward models. Human or AI labeling is still required; the RL stage is what is eliminated.

DATA DEPENDENCY

LIMITATION 02

Static Preference Data

RLHF with PPO generates new responses on-the-fly (online RL). DPO trains on a fixed dataset (offline). It cannot adapt to the model's own evolving outputs during training — a potential quality ceiling in some settings.

DESIGN TRADE-OFF

Practical impact: DPO and its variants (IPO, KTO, SimPO) have become the dominant approach for alignment fine-tuning of open-weight models as of 2024–2025. Major fine-tuning frameworks (Hugging Face TRL, LLaMA-Factory) ship DPO as a first-class training mode.

📎 Sources: Rafailov et al. — DPO (arXiv:2305.18290, 2023) · Hugging Face TRL — DPO Trainer

05 · SELF-IMPROVEMENT LOOPS

Reflexion, Verifiable Rewards, and Process Reward Models

Beyond training-time alignment, agents can improve their behavior at inference time through self-reflection and iterative revision — without gradient updates. These self-improvement loops are increasingly central to how high-capability agents operate.

PATTERN A — REFLEXION (Shinn et al., 2023)

Reflexion is an inference-time framework where an agent: (1) attempts a task, (2) receives feedback (from the environment or a verifier), (3) generates a verbal reflection on what went wrong and why, and (4) stores that reflection in an episodic memory buffer. On the next attempt, the reflection is injected into the context as additional guidance — improving performance without changing any weights.

Reflexion loop — conceptual pseudocode

memory = []

for attempt in range(MAX_ATTEMPTS):
    result = agent.run(task, memory=memory)

    if result.success:
        break

    # Reflect on failure in natural language
    reflection = llm.generate(
        f"You attempted: {task}\n"
        f"Your output: {result.output}\n"
        f"The outcome: {result.feedback}\n"
        "What went wrong and what will you do differently?"
    )
    memory.append(reflection)   # stored for next attempt

PATTERN B — VERIFIABLE REWARDS & RLVR

Some tasks have ground-truth verifiable answers: mathematical proofs, code that passes unit tests, formal logic. When rewards can be computed automatically (pass/fail, numeric score), RL training becomes far more scalable because human raters are not needed per sample. This Reinforcement Learning with Verifiable Rewards (RLVR) approach is the foundation of reasoning model training — where the model learns to produce longer, self-correcting reasoning chains that lead to verifiable correct answers.

PATTERN C — PROCESS REWARD MODELS (PRMs)

Standard reward models evaluate final answers (Outcome Reward Models / ORMs). A Process Reward Model (PRM), introduced in Lightman et al. (2023), instead assigns reward to each intermediate reasoning step. This enables RL to reinforce correct reasoning chains, not just correct final answers — penalizing shortcuts that reach the right answer via flawed logic.

Model Type	What It Evaluates	Key Trade-off
ORM (Outcome)	Final answer only	Simple to train; cannot distinguish lucky-correct from correctly-reasoned
PRM (Process)	Each intermediate step	More nuanced signal; requires step-level human or AI annotations

Reward Hacking: A persistent risk across all reward-based training. If the proxy reward signal imperfectly represents the true objective, the model will find and exploit that gap — generating outputs that score well on the reward model but are unhelpful or incorrect in reality. KL-divergence penalties, diverse evaluation sets, and human audits are the primary mitigations in practice.

📎 Sources: Shinn et al. — Reflexion (arXiv:2303.11366, 2023) · Lightman et al. — Let's Verify Step by Step / PRMs (arXiv:2305.20050, 2023) · Rafailov et al. — DPO (arXiv:2305.18290, 2023)

SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source	Type	Covers	Recency
Ouyang et al. — InstructGPT	Academic paper (OpenAI)	Full RLHF pipeline: SFT, reward model training, PPO fine-tuning	2022
Bai et al. — Constitutional AI	Academic paper (Anthropic)	Constitutional AI, RLAIF, AI-generated preference labels	2022
Rafailov et al. — DPO	Academic paper	Direct Preference Optimization — eliminates RL from alignment fine-tuning	2023
Shinn et al. — Reflexion	Academic paper	Inference-time self-improvement via verbal reflection and episodic memory	2023
Lightman et al. — Let's Verify Step by Step	Academic paper (OpenAI)	Process Reward Models (PRMs) vs Outcome Reward Models	2023
OpenAI Spinning Up — PPO	Official documentation	PPO algorithm, policy gradient methods, KL-divergence constraint	Maintained 2019–2026
Hugging Face TRL — DPO Trainer	Official documentation	Practical DPO implementation, preference data format, training loop	Maintained 2024–2026

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: April 5, 2026

RL & Self-Improvement

Agents, Environments, and Reward Signals

Reinforcement Learning from Human Feedback

Replacing Human Labels with AI Feedback

Dropping the RL in RLHF

Reflexion, Verifiable Rewards, and Process Reward Models

Verified References

Section 09 Quiz

RL & Self-
Improvement