RL & Self-
Improvement
LLMs don't just learn from text — they learn from feedback. Reinforcement Learning from Human Feedback (RLHF) is what separates a next-token predictor from a helpful assistant. And self-improvement loops — where agents reflect on failures and revise their behavior without gradient updates — are what separate brittle bots from adaptive agents. This section gives you the mental model behind both.
Agents, Environments, and Reward Signals
Reinforcement Learning (RL) describes a training paradigm where an agent takes actions in an environment, receives a reward signal, and updates its policy to maximize cumulative future reward. Unlike supervised learning — which needs labeled input-output pairs — RL only requires a reward function that evaluates outputs after the fact.
The core loop is: observe state → select action → receive reward → update policy. Formally, this is modeled as a Markov Decision Process (MDP): at each step, the policy maps the current state to a probability distribution over actions, and the environment returns a new state and a scalar reward.
Reinforcement Learning from Human Feedback
Pretraining teaches an LLM to predict text. But predicting text is not the same as being helpful, honest, or harmless. RLHF — introduced at scale in the InstructGPT paper (Ouyang et al., 2022) — is the training pipeline that bridges this gap. It has three stages:
Human labelers write high-quality example responses to a diverse set of prompts. The model is fine-tuned on these demonstration pairs using standard supervised learning. This gives the model the right output format and style before RL begins.
For each prompt, the SFT model generates multiple candidate responses. Human labelers rank these responses from best to worst. A separate neural network — the reward model (RM) — is trained to predict these preference rankings, outputting a scalar score for any given (prompt, response) pair.
The SFT model (now the policy) generates responses to prompts sampled from the training distribution. Each response is scored by the frozen reward model. PPO updates the policy weights to maximize reward — while a KL-divergence penalty prevents the policy from drifting too far from the SFT model and collapsing into reward-hacking behavior.
The reward model is trained on human preferences (ranking A over B), not on absolute quality labels. This is cheaper to collect and more reliable — it is easier for humans to say "this response is better than that one" than to assign a number from 1 to 10. The Elo-style comparisons also scale across annotators with differing absolute standards.
Replacing Human Labels with AI Feedback
Human preference labeling is expensive, slow, and inconsistent across annotators. Constitutional AI (CAI), introduced by Anthropic in Bai et al. (2022), proposes a partial solution: replace the human preference labeling step with AI-generated feedback, guided by a written set of principles called a constitution.
The CAI pipeline has two phases:
The model generates a response. A second model call critiques it against the constitutional principles ("Is this response harmful according to principle 3?") and then revises it. This revision becomes training data for a new SFT stage.
For RLHF's preference ranking step, AI models evaluate pairs of responses according to constitutional principles — producing preference labels at scale, without human raters. A reward model is trained on these AI-generated preferences.
| Dimension | Standard RLHF | Constitutional AI (CAI) |
|---|---|---|
| Preference labels | Human raters | AI model guided by written principles |
| Scalability | Bottlenecked by human capacity | Scales with compute, not headcount |
| Transparency | Human preferences are implicit | Principles are explicit and auditable |
| Consistency | Varies across annotators | More consistent — same model, same principles |
| Limitation | Human annotation cost and speed | AI feedback quality depends on the model used as critic |
Dropping the RL in RLHF
RLHF is effective but complex: it requires training a separate reward model, running PPO (which is notoriously unstable), and carefully tuning the KL penalty. Direct Preference Optimization (DPO), introduced by Rafailov et al. (2023), achieves the same alignment objective — training the model to produce preferred responses — using a single supervised learning step, with no RL loop and no explicit reward model.
DPO's key insight: there is a closed-form relationship between the optimal policy and the reward function. You can reparametrize the reward in terms of the policy itself, and then directly optimize the policy on preference pairs using a binary cross-entropy loss — without ever constructing a reward model or running PPO.
Reflexion, Verifiable Rewards, and Process Reward Models
Beyond training-time alignment, agents can improve their behavior at inference time through self-reflection and iterative revision — without gradient updates. These self-improvement loops are increasingly central to how high-capability agents operate.
Reflexion is an inference-time framework where an agent: (1) attempts a task, (2) receives feedback (from the environment or a verifier), (3) generates a verbal reflection on what went wrong and why, and (4) stores that reflection in an episodic memory buffer. On the next attempt, the reflection is injected into the context as additional guidance — improving performance without changing any weights.
memory = []
for attempt in range(MAX_ATTEMPTS):
result = agent.run(task, memory=memory)
if result.success:
break
# Reflect on failure in natural language
reflection = llm.generate(
f"You attempted: {task}\n"
f"Your output: {result.output}\n"
f"The outcome: {result.feedback}\n"
"What went wrong and what will you do differently?"
)
memory.append(reflection) # stored for next attempt
Some tasks have ground-truth verifiable answers: mathematical proofs, code that passes unit tests, formal logic. When rewards can be computed automatically (pass/fail, numeric score), RL training becomes far more scalable because human raters are not needed per sample. This Reinforcement Learning with Verifiable Rewards (RLVR) approach is the foundation of reasoning model training — where the model learns to produce longer, self-correcting reasoning chains that lead to verifiable correct answers.
Standard reward models evaluate final answers (Outcome Reward Models / ORMs). A Process Reward Model (PRM), introduced in Lightman et al. (2023), instead assigns reward to each intermediate reasoning step. This enables RL to reinforce correct reasoning chains, not just correct final answers — penalizing shortcuts that reach the right answer via flawed logic.
| Model Type | What It Evaluates | Key Trade-off |
|---|---|---|
| ORM (Outcome) | Final answer only | Simple to train; cannot distinguish lucky-correct from correctly-reasoned |
| PRM (Process) | Each intermediate step | More nuanced signal; requires step-level human or AI annotations |
Verified References
Every claim in this section is grounded in one of these sources. No content is generated from model training data alone.
| Source | Type | Covers | Recency |
|---|---|---|---|
| Ouyang et al. — InstructGPT | Academic paper (OpenAI) | Full RLHF pipeline: SFT, reward model training, PPO fine-tuning | 2022 |
| Bai et al. — Constitutional AI | Academic paper (Anthropic) | Constitutional AI, RLAIF, AI-generated preference labels | 2022 |
| Rafailov et al. — DPO | Academic paper | Direct Preference Optimization — eliminates RL from alignment fine-tuning | 2023 |
| Shinn et al. — Reflexion | Academic paper | Inference-time self-improvement via verbal reflection and episodic memory | 2023 |
| Lightman et al. — Let's Verify Step by Step | Academic paper (OpenAI) | Process Reward Models (PRMs) vs Outcome Reward Models | 2023 |
| OpenAI Spinning Up — PPO | Official documentation | PPO algorithm, policy gradient methods, KL-divergence constraint | Maintained 2019–2026 |
| Hugging Face TRL — DPO Trainer | Official documentation | Practical DPO implementation, preference data format, training loop | Maintained 2024–2026 |
Section 09 Quiz
8 questions covering all theory blocks. Select one answer per question, then submit.