Chapter 17: RLHF, DPO, KTO, Constitutional AI

After SFT (Chapter 16), the model knows how to behave like an assistant. But it doesn’t necessarily behave the way you want. Maybe it’s too verbose. Maybe it refuses too much. Maybe it makes confident factual errors. Maybe it hedges when it should commit. Maybe it cusses when it shouldn’t. SFT can’t fix these because they’re not about “what to say” — they’re about which of many plausible answers is the better one. You can’t just write training examples that say “say X, not Y” for every distinction you care about.

The answer is preference learning: collect (or generate) pairs of responses where one is judged “better” than the other, and train the model to produce the better one. This family of techniques — RLHF, DPO, KTO, Constitutional AI — is collectively called alignment or post-training, and it’s the third major stage of LLM training after pretraining and SFT.

This chapter is the hardest one in Part II to keep concise because the field is moving fast. I’m going to focus on the durable concepts: the framework, the classical RLHF pipeline, why DPO simplified it, and the current state. If you understand RLHF, DPO, and the KL constraint, the rest of the field is variations.

Outline:

Why preference learning, not SFT-only.
The classical RLHF pipeline: SFT → reward model → PPO.
The reward model.
PPO in the RLHF context.
The KL penalty and the reference model.
RLHF’s failure modes.
DPO: collapsing the pipeline into one stage.
The DPO derivation and what it actually optimizes.
KTO: from preferences to binary signal.
Constitutional AI and the self-critique loop.
The current state of the art.

17.1 Why preference learning

SFT works by showing the model examples of what to do. It can’t show the model what not to do, except by omission. And it can’t capture preferences over multiple valid answers.

Consider the prompt “Explain photosynthesis.” There are dozens of valid responses: a one-sentence summary, a paragraph for a child, a detailed mechanism for a biology student, a poetic metaphor. Each is correct. SFT picks one and trains the model on it; the others are lost.

Worse: there are subtle qualities that distinguish “good” from “great” responses, and these qualities are easier to judge than to specify. Two responses might both be technically correct, but one is more useful, more concise, less hedging, more honest about uncertainty. Asking a human to write the perfect response is hard. Asking a human to rank two existing responses is easy. Preference learning exploits this asymmetry.

The framework: collect data of the form (prompt, response_A, response_B, preferred_one), and train the model to prefer the preferred response. This is simpler to gather, more scalable, and captures dimensions of quality that SFT can’t.

graph TD
  PT[Pretrained base model] --> SFT[SFT — instruction following]
  SFT --> RM[Reward model training<br/>on human preference pairs]
  SFT --> PPO[Policy — PPO training]
  RM -->|score rollouts| PPO
  PPO -->|KL penalty vs| REF[Frozen SFT reference]
  PPO --> ALIGNED[Aligned model]
  style PPO fill:var(--fig-accent-soft),stroke:var(--fig-accent)
  style ALIGNED fill:var(--fig-surface),stroke:var(--fig-border)

The three-stage RLHF pipeline requires four model copies in memory simultaneously (policy, reference, reward model, and value model) — the main reason it was displaced by DPO for open-source work.

17.2 The classical RLHF pipeline

RLHF (Reinforcement Learning from Human Feedback) was introduced for language models in the InstructGPT paper (Ouyang et al., 2022) and was the key to making ChatGPT a usable product. The pipeline has three stages:

Stage 1: SFT

You start with a pretrained base model and do SFT on a small instruction-following dataset (Chapter 16). This produces an “SFT model” that behaves like a chatbot but isn’t yet aligned to preferences.

Stage 2: Reward modeling

You collect a dataset of human preferences. For each prompt, sample two (or more) responses from the SFT model, and have a human label which one they prefer. The dataset has the form:

(prompt, response_chosen, response_rejected)

Then you train a reward model: a separate model (often initialized from the SFT checkpoint) that takes a (prompt, response) pair and outputs a scalar score. The reward model is trained with a pairwise loss:

L = -log σ(r(prompt, chosen) - r(prompt, rejected))

This is the Bradley-Terry preference model. It says: the probability that the chosen response is preferred over the rejected one should be proportional to σ of the score difference. Maximize this likelihood and you get a model whose scores correlate with human preferences.

Reward models are typically trained on tens of thousands to hundreds of thousands of preference pairs. The InstructGPT reward model used about 33k human comparisons. Modern frontier reward models use millions.

Stage 3: PPO

Now use the reward model as a scalar reward signal to train the SFT model further with reinforcement learning. The setup:

The SFT model is the policy.
For each prompt, the policy samples a response.
The reward model scores the response.
The policy is updated to produce responses that score higher.

The RL algorithm of choice is PPO (Proximal Policy Optimization), a workhorse from deep RL that handles continuous policy updates without going off the rails. PPO clips the policy update so each step doesn’t move the policy too far from where it was, which is critical for stability.

Concretely, the PPO update for an LLM policy is:

L_PPO = E[ min(r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t) ]

where r_t(θ) is the ratio of the new policy’s probability of the action vs the old policy’s, and A_t is the “advantage” — how much better the action was than expected. The clip is the PPO trick: it bounds the ratio so the policy can’t move more than ε (typically 0.2) in one step.

This is standard PPO. The interesting part is what we add for LLMs.

17.3 The reward model in detail

Building a good reward model is the hardest part of the RLHF pipeline. The reward model has to capture human preferences accurately, generalize to new prompts, and not be too easy to “hack” — that is, the policy shouldn’t be able to find responses that score high but are actually terrible.

Reward model architecture:

Same base architecture as the SFT model (often initialized from the SFT checkpoint).
A linear head on top of the final layer that outputs a single scalar — the reward.
Sometimes a log-sigmoid output to bound the reward in (-∞, 0]; sometimes raw.

Reward model training:

Pairwise loss as in §17.2: maximize the likelihood that chosen scores higher than rejected.
Standard hyperparameters (low LR, few epochs).
Held-out validation set of preference pairs to monitor for overfitting.

Reward model quality is measured by agreement with human labelers on a held-out set. A typical “good” reward model agrees with humans about 70–75% of the time on hard preference pairs (pairs where both responses are reasonable but one is slightly better). Random would be 50%; perfect would be ~85% (the agreement rate between humans themselves on the same pairs).

The reward model is typically the same parameter count or smaller than the policy. There’s no benefit to making it larger; quality is bounded by the preference data.

17.4 PPO in the RLHF context

PPO for language models is a specific flavor of RL with a few characteristic choices:

The action space is the vocabulary at each generation step. Each token sampled from the policy is an “action.”
Episodes are full responses. A single rollout = one prompt → one full sampled response → one reward from the reward model.
Rewards are sparse. Only at the end of the response (when EOS is sampled) does the reward model give a score. The intermediate token-level rewards are zero (except for the KL penalty, see §17.5).
The advantage is computed via GAE (Generalized Advantage Estimation), which propagates the final reward back to earlier tokens.

The training loop runs a rollout buffer: sample N responses from the current policy on a batch of prompts, score them all with the reward model, compute advantages, do K PPO updates on the buffer, throw it away, and repeat. This is the standard on-policy RL pattern.

Why PPO and not other RL algorithms? Because PPO is stable. Other RL algorithms (REINFORCE, A2C, ACKTR) can work but tend to be more brittle. PPO’s clipping prevents the policy from moving too far in one step, which is critical when the reward signal is noisy and the policy is a 70B-parameter language model.

17.5 The KL penalty and the reference model

Here’s the part that distinguishes RLHF from generic RL: the KL penalty.

If you train a language model with pure PPO against a reward model, the policy will quickly find ways to hack the reward model — generate weird, ungrammatical, repetitive text that happens to score high. The reward model is a finite neural network with finite training data; there are infinitely many out-of-distribution inputs that produce high scores accidentally. The policy will find them.

The fix: keep a copy of the SFT model frozen as the reference model, and add a penalty term to the reward that punishes the policy for deviating too far from the reference:

total_reward = reward_model(prompt, response) - β × KL(policy || reference)

Where the KL is computed token-by-token between the new policy’s probability distribution and the reference model’s. The β hyperparameter controls how strong the penalty is.

The KL penalty does two things:

Keeps the policy in the linguistic distribution. The reference model knows English; if the policy moves too far from it, the KL penalty grows fast. So the policy stays in “English-like” output space.
Limits how much the policy can change. Even if the reward model has exploits, the KL penalty prevents the policy from moving toward them aggressively.

The KL penalty is the most important hyperparameter in RLHF. Too low, and the model reward-hacks. Too high, and the model never improves over the SFT baseline. Tuning it is the dark art of RLHF.

The KL penalty β controls the quality–divergence tradeoff: too low and the policy reward-hacks; too high and it stays pinned to the SFT baseline — making β the most important RLHF hyperparameter.

The reference model is held in memory throughout training (it’s not updated). This means RLHF training has two model copies in memory: the policy and the reference. For a 70B model, that’s 280 GB just for the two policy/reference pairs in bf16, plus the reward model on top, plus the optimizer state and gradients for the policy. RLHF is expensive.

17.6 RLHF failure modes

RLHF is the most operationally fragile part of the pipeline. The failure modes:

(1) Reward hacking. The policy finds ways to score high on the reward model that the reward model wasn’t trained to catch. Symptoms: weird, ungrammatical, or off-distribution outputs that are scored highly. Fix: better reward model, larger preference dataset, stronger KL penalty.

(2) Mode collapse. The policy converges to producing the same response (or very similar responses) regardless of prompt. The model has found a “safe” answer that always scores reasonably and stops exploring. Fix: more KL penalty, more diverse prompts in training, careful temperature settings during rollouts.

(3) Sycophancy. The model learns to agree with the user, regardless of correctness, because human raters tend to prefer “agreeable” responses. This is one of the most documented failure modes of RLHF — the famous Sharma et al. (2023) paper Towards Understanding Sycophancy in Language Models showed it across multiple frontier models.

(4) Verbosity bias. Human raters tend to prefer longer responses (they look more thorough). The model learns to be verbose. RLHF outputs are famously long-winded.

(5) Refusal calibration. The model becomes either too refusal-happy (refuses benign requests) or not refusal-happy enough (complies with harmful ones). Both are common; both are tuned by adjusting the preference dataset.

(6) Instability. RLHF training is sensitive to learning rate, KL penalty, batch size, and reward model quality. Small changes can cause divergence. Many RLHF runs fail entirely and have to be restarted from earlier checkpoints.

These failure modes are why RLHF is hard, why it costs millions of dollars to do well, and why almost no open-source teams have produced top-tier RLHF’d models. The frontier labs (OpenAI, Anthropic, Google DeepMind) have built years of internal expertise on this; everyone else either uses simpler methods or buys data from them.

The simpler method that has displaced RLHF for most open work is DPO.

17.7 DPO — collapsing the pipeline

Direct Preference Optimization (DPO) (Rafailov et al., 2023) is a beautiful result. The paper shows that the entire RLHF pipeline — reward model + PPO + KL penalty — can be replaced with a single supervised loss that operates directly on preference pairs.

The trick: there’s a closed-form mathematical relationship between the optimal RLHF policy and the reward model + reference model. Specifically, given a reward function r and a reference policy π_ref, the optimal policy under the KL-constrained RLHF objective is:

π*(y | x) ∝ π_ref(y | x) × exp(r(x, y) / β)

This is the maximum-entropy policy. Rearranging:

r(x, y) = β × log [ π*(y | x) / π_ref(y | x) ] + constant

Now, if you plug this into the Bradley-Terry preference model that the reward model is trained on, you get:

P(y_chosen > y_rejected | x) = σ( β × log[π*(chosen) / π_ref(chosen)] - β × log[π*(rejected) / π_ref(rejected)] )

And the loss to maximize this likelihood is:

L_DPO = -log σ( β × log[π_θ(chosen) / π_ref(chosen)] - β × log[π_θ(rejected) / π_ref(rejected)] )

That’s the DPO loss. There is no reward model. There is no PPO. There is no rollout. You take the policy π_θ (the model you’re training), the reference π_ref (the SFT model, frozen), and a preference pair (chosen, rejected). You compute the log-probability of the chosen response under both policies, the log-probability of the rejected response under both policies, and you minimize the loss above. It’s a standard supervised training loop.

The implications are huge:

No reward model to train. Saves a whole training step.
No PPO instability. DPO is supervised; it has the same stability as SFT.
No rollout cost. You don’t have to sample responses during training; you just use the (chosen, rejected) pairs directly.
Memory: two models, not three. You need the policy and the reference, but not the reward model.

The catch: DPO is off-policy. The (chosen, rejected) pairs were generated by some other process, not by the current policy. This means DPO can’t explore for new behaviors the way RLHF can. In practice, DPO works well when the preference data is high-quality and reasonably diverse, and worse when the data is narrow.

DPO eliminates the reward model and PPO loop entirely, replacing three training stages with one supervised loss — which is why it became the default for open-source alignment.

DPO has become the default alignment method for open-source post-training in 2023–2025. Most open instruction-tuned models (Zephyr, Llama 3 Instruct, Mistral Instruct, Qwen Instruct) use DPO or a variant. The frontier labs still use RLHF (often with DPO as a warm-up), but the open community has standardized on DPO.

17.8 The DPO loss in practice

The full DPO training loop, in pseudocode:

for batch in dataloader:
    chosen, rejected = batch['chosen'], batch['rejected']
    
    # Compute log-probs under the current policy
    log_p_chosen = log_prob(policy, chosen)
    log_p_rejected = log_prob(policy, rejected)
    
    # Compute log-probs under the frozen reference
    with torch.no_grad():
        log_pref_chosen = log_prob(reference, chosen)
        log_pref_rejected = log_prob(reference, rejected)
    
    # The DPO logits
    logits = beta * (
        (log_p_chosen - log_pref_chosen) -
        (log_p_rejected - log_pref_rejected)
    )
    
    loss = -F.logsigmoid(logits).mean()
    
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

A few practical notes:

log_prob(model, sequence) is the sum of token-level log-probabilities across the response (with the prompt masked out). Same masking as SFT.
The reference model is frozen and only used to compute reference log-probs. It’s typically the SFT checkpoint that immediately preceded DPO.
beta is the temperature/strength parameter. Common values: 0.1 to 0.5. Higher beta means stronger preference signal but more risk of reference-distance issues.
The implementation in trl.DPOTrainer handles all of this for you. You provide the dataset (with chosen and rejected columns) and the reference model.

DPO trains in roughly the same time as a single SFT epoch — it’s a supervised training loop, just with a slightly more complex loss. Compare to RLHF, which can take 5–10× the SFT time per update step.

17.9 KTO — from preferences to binary signal

KTO (Kahneman-Tversky Optimization) (Ethayarajh et al., 2024) is a recent variant that operates on binary signals instead of pairwise preferences. The motivation: collecting (chosen, rejected) pairs is more expensive than collecting “this response was good” or “this response was bad” labels independently. KTO lets you train on the cheaper signal.

The KTO loss is structurally similar to DPO but doesn’t require pairs:

L_KTO = E[ kahneman_tversky_value(reward_diff(x, y, label)) ]

The “value function” is inspired by behavioral economics — humans have asymmetric reactions to gains vs losses, and KTO encodes this asymmetry into the training signal. The math is more complex than DPO, but the practical effect is that you can train on a dataset of (prompt, response, is_good) triples instead of (prompt, chosen, rejected) pairs.

KTO is gaining adoption in 2024–2025 for cases where binary labels are easier to collect than pairs. It hasn’t displaced DPO yet (DPO is still the broadly-used default), but it’s a real alternative when the data shape favors it.

17.10 Constitutional AI

Constitutional AI (CAI) (Bai et al., 2022) is Anthropic’s approach to alignment. The idea: instead of (or in addition to) human preference data, use the model itself as the source of preference judgments, guided by a “constitution” — a set of natural-language principles.

The pipeline:

Helpful-only model. Start with an SFT model trained to be maximally helpful, without safety considerations.
Self-critique. For each model response, use the same model (or a separate critique model) to evaluate whether the response violates any constitutional principle. The constitution is a list like “be harmless,” “don’t help with illegal activities,” “don’t promote violence,” etc.
Self-revision. Have the model rewrite the response to address the critique.
Generate preference pairs. The original response is the “rejected”; the revised response is the “chosen.” Train with DPO (or RLHF, originally called RLAIF — Reinforcement Learning from AI Feedback).

The killer feature is that no humans are needed for the preference labeling step. The model labels its own outputs, guided by the constitution. This makes the process scalable in a way that human-labeled RLHF isn’t, and it makes the alignment principles transparent — they’re written in plain English in the constitution, not buried in a reward model.

Anthropic has used CAI extensively for Claude. The constitutions have evolved over time and are partially public. The approach has been replicated by other labs and is one of the techniques that is gradually replacing pure-human RLHF for safety-related alignment.

The risk of CAI is that the model is judging itself, so any biases or blind spots in the model propagate into the alignment data. In practice, this is mitigated by mixing constitutional self-feedback with human feedback on key safety topics.

17.11 The current state

As of late 2025, the post-training pipeline of a frontier LLM looks roughly like:

Pretraining — many trillions of tokens, weeks to months on thousands of GPUs.
SFT — hundreds of thousands to millions of (instruction, response) pairs, mostly synthetic-from-our-own-models, multi-task mixed.
Preference data collection — millions of pairs, mostly from internal model judges or constitutional self-critique, with human review on safety-critical subsets.
DPO and/or RLHF — DPO for the bulk of the alignment, RLHF for the parts where on-policy exploration matters.
Constitutional AI for safety alignment, with the constitution iterated based on observed failures.
Iteration — collect preference data on the new model, train another round, evaluate, repeat. The frontier teams iterate continuously.

graph LR
  PT[Pretrain<br/>weeks · trillions of tokens] --> SFT2[SFT<br/>days · millions of pairs]
  SFT2 --> PREF[Preference data<br/>AI judges + human review]
  PREF --> DPO2[DPO / RLHF<br/>hours–days]
  DPO2 --> CAI2[Constitutional AI<br/>safety pass]
  CAI2 --> EVAL[Eval + iterate]
  EVAL -->|new preference data| PREF
  style DPO2 fill:var(--fig-accent-soft),stroke:var(--fig-accent)

The frontier post-training loop is iterative: each aligned model generates better preference data for the next round, compressing months of training into a tighter and tighter improvement cycle.

The open-source pipeline is similar but simpler:

Pretraining — happens once, usually only by labs that release base models.
SFT — on a public or curated synthetic dataset, often LoRA.
DPO — on a public preference dataset (HH-RLHF, UltraFeedback, OpenHermes-Preferences) or a custom one, often LoRA.
Done. Released on HuggingFace.

The simpler pipeline is good enough to produce competitive open models. The remaining gap to frontier labs is mostly in the preference data quality, the iteration tightness, and the specific safety / refusal calibration — not in the algorithmic stack.

17.12 The mental model

Eight points to take into Chapter 18:

Preference learning starts where SFT ends. It teaches the model to prefer some responses over others.
Classical RLHF is three stages: SFT → reward model → PPO with KL penalty. It works but is operationally fragile.
The KL penalty to a frozen reference is the main thing that keeps RLHF stable.
Reward hacking, mode collapse, sycophancy, verbosity bias, refusal calibration are the canonical RLHF failure modes.
DPO collapses RLHF into a single supervised loss with no reward model and no PPO. Default for open-source alignment.
The DPO loss is -log σ(β × (log_ratio_chosen - log_ratio_rejected)). It’s a one-step closed form for the optimal RLHF policy.
KTO lets you train on binary good/bad signals when pairs are too expensive.
Constitutional AI uses model self-critique guided by a written constitution to scale safety alignment without human raters.

In Chapter 18 we look at the third major thing you can do at training time: making the model smaller, via distillation and pruning.

Read it yourself

Ouyang et al., Training language models to follow instructions with human feedback (InstructGPT, 2022). The RLHF paper.
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023). The DPO paper. Sections 4 and 5 contain the derivation.
Ethayarajh et al., KTO: Model Alignment as Prospect Theoretic Optimization (2024).
Bai et al., Constitutional AI: Harmlessness from AI Feedback (2022). The original CAI paper.
Sharma et al., Towards Understanding Sycophancy in Language Models (2023). The sycophancy diagnosis.
The HuggingFace trl library documentation for DPOTrainer, PPOTrainer, KTOTrainer.

Practice

Why can’t SFT alone teach a model to be more concise? Construct a case where two responses are both correct under SFT but one is much better.
Derive the DPO loss from the optimal RLHF policy. Show your work. (The Rafailov paper section 4 is the reference.)
Why does RLHF need a frozen reference model and a KL penalty? What goes wrong if you remove either?
Use trl.DPOTrainer to train a small DPO on a 7B model with the UltraFeedback dataset. Compare the resulting model to the SFT base on five prompts.
Sycophancy is one of RLHF’s well-known failure modes. Read the Sharma et al. paper and design a training-data fix.
Why is DPO off-policy and RLHF on-policy? When does this difference matter in practice?
Stretch: Implement DPO from scratch in 50 lines of PyTorch (no trl). Train it on a tiny preference dataset and verify the model’s preference accuracy improves. The point is to internalize the loss.

Concept check