Course  /  09 · RL & Self-Improvement
SECTION 09 CORE THEORY

RL & Self-
Improvement

LLMs don't just learn from text — they learn from feedback. Reinforcement Learning from Human Feedback (RLHF) is what separates a next-token predictor from a helpful assistant. And self-improvement loops — where agents reflect on failures and revise their behavior without gradient updates — are what separate brittle bots from adaptive agents. This section gives you the mental model behind both.

01 · THE RL FEEDBACK LOOP

Agents, Environments, and Reward Signals

Reinforcement Learning (RL) describes a training paradigm where an agent takes actions in an environment, receives a reward signal, and updates its policy to maximize cumulative future reward. Unlike supervised learning — which needs labeled input-output pairs — RL only requires a reward function that evaluates outputs after the fact.

The core loop is: observe state → select action → receive reward → update policy. Formally, this is modeled as a Markov Decision Process (MDP): at each step, the policy maps the current state to a probability distribution over actions, and the environment returns a new state and a scalar reward.

TERM 01
Policy (π)
The agent's decision function — maps a state to an action (or a distribution over actions). For an LLM, the policy is the model weights that determine what token to output next.
FOUNDATIONAL CONCEPT
TERM 02
Reward Signal
A scalar value provided after each action (or at episode end) that tells the agent how well it did. The reward function is the most critical design decision in any RL system — it determines what the agent optimizes for.
FOUNDATIONAL CONCEPT
TERM 03
Environment
Everything outside the agent. For a game-playing agent it is the game engine. For an LLM agent it is the tools, APIs, file systems, and user turns — everything that returns observations in response to actions.
FOUNDATIONAL CONCEPT
TERM 04
PPO (Proximal Policy Optimization)
The RL algorithm most commonly used to train LLMs on reward signals. PPO updates the policy while preventing it from drifting too far from the previous policy in a single step — the "proximal" constraint stabilizes training.
WIDELY USED (2022–2026)
Analogy: RL is like training a dog. You don't show the dog a labeled dataset of "correct sits." You reward the dog when it sits, withhold treats when it doesn't, and the dog's behavior converges toward more sitting. The reward function is the trainer's intent — if you accidentally reward jumping up instead of sitting, you get a jumpy dog. The same misalignment risk applies to LLM reward models.
02 · RLHF

Reinforcement Learning from Human Feedback

Pretraining teaches an LLM to predict text. But predicting text is not the same as being helpful, honest, or harmless. RLHF — introduced at scale in the InstructGPT paper (Ouyang et al., 2022) — is the training pipeline that bridges this gap. It has three stages:

STAGE 1 — SUPERVISED FINE-TUNING (SFT)

Human labelers write high-quality example responses to a diverse set of prompts. The model is fine-tuned on these demonstration pairs using standard supervised learning. This gives the model the right output format and style before RL begins.

STAGE 2 — REWARD MODEL TRAINING

For each prompt, the SFT model generates multiple candidate responses. Human labelers rank these responses from best to worst. A separate neural network — the reward model (RM) — is trained to predict these preference rankings, outputting a scalar score for any given (prompt, response) pair.

STAGE 3 — RL FINE-TUNING (PPO)

The SFT model (now the policy) generates responses to prompts sampled from the training distribution. Each response is scored by the frozen reward model. PPO updates the policy weights to maximize reward — while a KL-divergence penalty prevents the policy from drifting too far from the SFT model and collapsing into reward-hacking behavior.

Why KL divergence matters: Without the KL penalty, the policy would quickly learn to generate outputs that score high on the reward model but look nothing like natural language — a phenomenon called reward hacking. The penalty anchors the RL-trained model to the SFT baseline, acting as a regularizer.

The reward model is trained on human preferences (ranking A over B), not on absolute quality labels. This is cheaper to collect and more reliable — it is easier for humans to say "this response is better than that one" than to assign a number from 1 to 10. The Elo-style comparisons also scale across annotators with differing absolute standards.

03 · CONSTITUTIONAL AI & RLAIF

Replacing Human Labels with AI Feedback

Human preference labeling is expensive, slow, and inconsistent across annotators. Constitutional AI (CAI), introduced by Anthropic in Bai et al. (2022), proposes a partial solution: replace the human preference labeling step with AI-generated feedback, guided by a written set of principles called a constitution.

The CAI pipeline has two phases:

PHASE 1 — SUPERVISED CRITIQUE & REVISION (SL-CAI)

The model generates a response. A second model call critiques it against the constitutional principles ("Is this response harmful according to principle 3?") and then revises it. This revision becomes training data for a new SFT stage.

PHASE 2 — RLAIF (RL from AI Feedback)

For RLHF's preference ranking step, AI models evaluate pairs of responses according to constitutional principles — producing preference labels at scale, without human raters. A reward model is trained on these AI-generated preferences.

Dimension Standard RLHF Constitutional AI (CAI)
Preference labels Human raters AI model guided by written principles
Scalability Bottlenecked by human capacity Scales with compute, not headcount
Transparency Human preferences are implicit Principles are explicit and auditable
Consistency Varies across annotators More consistent — same model, same principles
Limitation Human annotation cost and speed AI feedback quality depends on the model used as critic
Not a replacement: CAI reduces but does not eliminate the need for human judgment. The constitutional principles themselves must be written and reviewed by humans. The AI evaluator's quality sets the ceiling on feedback quality.
04 · DPO — DIRECT PREFERENCE OPTIMIZATION

Dropping the RL in RLHF

RLHF is effective but complex: it requires training a separate reward model, running PPO (which is notoriously unstable), and carefully tuning the KL penalty. Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023), achieves the same alignment objective — training the model to produce preferred responses — using a single supervised learning step, with no RL loop and no explicit reward model.

DPO's key insight: there is a closed-form relationship between the optimal policy and the reward function. You can reparametrize the reward in terms of the policy itself, and then directly optimize the policy on preference pairs using a binary cross-entropy loss — without ever constructing a reward model or running PPO.

ADVANTAGE 01
No Reward Model Required
DPO optimizes directly on preference pairs (winner, loser) without training a separate scorer. This eliminates a large source of complexity and error in the RLHF pipeline.
EMERGING PATTERN (2023–2026)
ADVANTAGE 02
No PPO Instability
PPO requires careful hyperparameter tuning — clip ratio, learning rate, mini-batch size — and can destabilize training. DPO is a supervised update and is generally more stable and reproducible.
EMERGING PATTERN (2023–2026)
LIMITATION 01
Requires Preference Pairs
DPO still needs a dataset of (prompt, chosen response, rejected response) pairs — the same data used to train RLHF reward models. Human or AI labeling is still required; the RL stage is what is eliminated.
DATA DEPENDENCY
LIMITATION 02
Static Preference Data
RLHF with PPO generates new responses on-the-fly (online RL). DPO trains on a fixed dataset (offline). It cannot adapt to the model's own evolving outputs during training — a potential quality ceiling in some settings.
DESIGN TRADE-OFF
Practical impact: DPO and its variants (IPO, KTO, SimPO) have become the dominant approach for alignment fine-tuning of open-weight models as of 2024–2025. Major fine-tuning frameworks (Hugging Face TRL, LLaMA-Factory) ship DPO as a first-class training mode.
05 · SELF-IMPROVEMENT LOOPS

Reflexion, Verifiable Rewards, and Process Reward Models

Beyond training-time alignment, agents can improve their behavior at inference time through self-reflection and iterative revision — without gradient updates. These self-improvement loops are increasingly central to how high-capability agents operate.

PATTERN A — REFLEXION (Shinn et al., 2023)

Reflexion is an inference-time framework where an agent: (1) attempts a task, (2) receives feedback (from the environment or a verifier), (3) generates a verbal reflection on what went wrong and why, and (4) stores that reflection in an episodic memory buffer. On the next attempt, the reflection is injected into the context as additional guidance — improving performance without changing any weights.

Reflexion loop — conceptual pseudocode
memory = []

for attempt in range(MAX_ATTEMPTS):
    result = agent.run(task, memory=memory)

    if result.success:
        break

    # Reflect on failure in natural language
    reflection = llm.generate(
        f"You attempted: {task}\n"
        f"Your output: {result.output}\n"
        f"The outcome: {result.feedback}\n"
        "What went wrong and what will you do differently?"
    )
    memory.append(reflection)   # stored for next attempt
PATTERN B — VERIFIABLE REWARDS & RLVR

Some tasks have ground-truth verifiable answers: mathematical proofs, code that passes unit tests, formal logic. When rewards can be computed automatically (pass/fail, numeric score), RL training becomes far more scalable because human raters are not needed per sample. This Reinforcement Learning with Verifiable Rewards (RLVR) approach is the foundation of reasoning model training — where the model learns to produce longer, self-correcting reasoning chains that lead to verifiable correct answers.

PATTERN C — PROCESS REWARD MODELS (PRMs)

Standard reward models evaluate final answers (Outcome Reward Models / ORMs). A Process Reward Model (PRM), introduced in Lightman et al. (2023), instead assigns reward to each intermediate reasoning step. This enables RL to reinforce correct reasoning chains, not just correct final answers — penalizing shortcuts that reach the right answer via flawed logic.

Model TypeWhat It EvaluatesKey Trade-off
ORM (Outcome)Final answer onlySimple to train; cannot distinguish lucky-correct from correctly-reasoned
PRM (Process)Each intermediate stepMore nuanced signal; requires step-level human or AI annotations
Reward Hacking: A persistent risk across all reward-based training. If the proxy reward signal imperfectly represents the true objective, the model will find and exploit that gap — generating outputs that score well on the reward model but are unhelpful or incorrect in reality. KL-divergence penalties, diverse evaluation sets, and human audits are the primary mitigations in practice.
SOURCES USED IN THIS SECTION

Verified References

Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.

Source Type Covers Recency
Ouyang et al. — InstructGPT Academic paper (OpenAI) Full RLHF pipeline: SFT, reward model training, PPO fine-tuning 2022
Bai et al. — Constitutional AI Academic paper (Anthropic) Constitutional AI, RLAIF, AI-generated preference labels 2022
Rafailov et al. — DPO Academic paper Direct Preference Optimization — eliminates RL from alignment fine-tuning 2023
Shinn et al. — Reflexion Academic paper Inference-time self-improvement via verbal reflection and episodic memory 2023
Lightman et al. — Let's Verify Step by Step Academic paper (OpenAI) Process Reward Models (PRMs) vs Outcome Reward Models 2023
OpenAI Spinning Up — PPO Official documentation PPO algorithm, policy gradient methods, KL-divergence constraint Maintained 2019–2026
Hugging Face TRL — DPO Trainer Official documentation Practical DPO implementation, preference data format, training loop Maintained 2024–2026
KNOWLEDGE CHECK

Section 09 Quiz

8 questions covering all theory blocks. Select one answer per question, then submit.

📝
Section 09 — RL & Self-Improvement
8 QUESTIONS · MULTIPLE CHOICE · UNLIMITED RETRIES
Question 1 of 8
In the RLHF training pipeline, what role does the reward model play during the PPO fine-tuning stage?
Question 2 of 8
Constitutional AI (Bai et al., 2022) differs from standard RLHF primarily because:
Question 3 of 8
DPO (Direct Preference Optimization, Rafailov et al. 2023) simplifies the alignment training pipeline by eliminating which component present in standard RLHF?
Question 4 of 8
In the Reflexion framework (Shinn et al., 2023), how does an agent improve its performance on subsequent attempts without updating its model weights?
Question 5 of 8
When RL terminology is applied to an LLM agent loop, what component plays the role of the "environment"?
Question 6 of 8
A Process Reward Model (PRM) differs from an Outcome Reward Model (ORM) because a PRM:
Question 7 of 8
Why are verifiable rewards (e.g., for math or code tasks) advantageous over human preference labels for RL training?
Question 8 of 8
Which scenario best describes "reward hacking" in an RLHF-trained model?

Finished the theory and passed the quiz? Mark this section complete to track your progress.

Last updated: