Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

Direct Preference Optimization (DPO) aligns language models with human preferences using a simple classification loss instead of reinforcement learning.

How can we align language models with human preferences without the complexity and instability of reinforcement learning?

Reinforcement learning from human feedback (RLHF) is the standard for aligning language models, but it requires training a separate reward model and using complex, unstable reinforcement learning algorithms to optimize the policy. Direct Preference Optimization (DPO) bypasses this by identifying a mathematical mapping that allows the optimal policy to be extracted directly from the reward function. This enables training the model using a simple binary cross-entropy loss on preference data, eliminating the need for explicit reward modeling or reinforcement learning loops. DPO matches or exceeds the performance of PPO-based RLHF across sentiment, summarization, and dialogue tasks while being significantly more stable and computationally efficient.

Paper Primer

Existing RLHF pipelines are notoriously difficult to tune, requiring a multi-stage process: fitting a reward model to human preferences, then using reinforcement learning (typically Proximal Policy Optimization) to maximize that reward while keeping the model close to its original state. DPO simplifies this by re-parameterizing the reward function in terms of the optimal policy itself, turning a complex RL problem into a straightforward supervised classification task.

DPO achieves a more efficient reward-versus-KL-divergence frontier than PPO.

In controlled sentiment generation experiments, DPO strictly dominates PPO, even when PPO is provided with ground-truth rewards. DPO consistently reaches higher reward levels while maintaining lower KL-divergence from the reference model.

Why does DPO avoid the instability associated with standard RLHF?

Standard RLHF relies on an actor-critic framework where the policy gradient can have high variance, especially when normalizing rewards. DPO’s re-parameterization eliminates the need for a learned value function or Monte-Carlo reward baselines, resulting in a stable, single-stage optimization.

Does DPO require a specific type of preference data?

No; DPO uses the same offline datasets of human preferences (pairs of preferred and dispreferred completions) as standard RLHF methods, making it a drop-in replacement for existing pipelines.

Researchers can now replace complex, unstable RLHF pipelines with a single-stage classification objective, significantly lowering the barrier to training aligned language models.

Introduction to DPO

We expose why the multi‑stage RLHF pipeline is unstable and set up Direct Preference Optimization.

Reinforcement Learning from Human Feedback (RLHF) steers language models by first fitting a reward model to preference data and then using reinforcement learning to tune the policy. This two‑stage pipeline is notoriously complex and often unstable, because errors in the reward model or the RL loop can cause the policy to drift or collapse. Direct Preference Optimization (DPO) sidesteps these issues by optimizing the policy directly with a simple classification loss.

RLHF first learns a reward model that predicts human preference scores, then uses that model to guide a policy‑learning phase via reinforcement learning.

A reward model is a classifier that assigns higher scores to responses humans prefer, serving as a proxy for human judgment during policy training.

Generate $B=32$ responses, each of length $L=64$.

Compute $32$ attention matrices, totaling $512\text{ KB}$ of memory.

Run $K=4$ PPO gradient updates on the batch, each requiring a forward and backward pass through the attention layers.

Because memory and compute scale with both sequence length and the number of PPO updates, RLHF becomes prohibitively expensive for long contexts or large batch sizes.

**Figure 1.** DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.

The instability of RLHF stems from the multi‑stage training process.

Background and Baselines

We situate DPO among prior RL‑based and preference‑learning approaches and recap the RLHF pipeline.

Prior work converges on two strands: reinforcement‑learning methods that optimize a language model against a learned reward, and preference‑learning techniques that infer preferences from human judgments.

Fine‑tunes a pre‑trained model on a corpus of instructions and human‑written completions, enabling zero‑shot generalization to unseen prompts.

Uses datasets of human‑ranked completions to fine‑tune language models, improving qualities such as translation fidelity, summarization coherence, and story‑telling style.

A probabilistic model that maps a latent utility score to the probability that one item is preferred over another.

A Monte‑Carlo policy‑gradient algorithm that updates a policy by scaling log‑probability gradients with observed returns.

A bandit setting where the learner receives pairwise preference feedback instead of scalar rewards.

A policy whose expected win rate against any other policy is at least 50 % in the CDB setting.

Learns a policy directly from binary preference feedback, typically by first estimating a latent scoring function.

The RLHF pipeline proceeds in three stages: (1) supervised fine‑tuning (SFT) of a pretrained model, (2) reward‑model learning from human‑ranked pairs, and (3) RL fine‑tuning of the policy against the learned reward with a KL‑penalty to stay close to the reference policy.

PPO updates a policy by taking a conservative step: it maximizes the expected advantage while clipping the probability ratio to stay near the old policy.

The DPO Method

Derive a simple, closed‑form loss that lets a language model learn from preferences without RL.

Fine‑tuning a language model with reinforcement learning scales poorly: the reward model must be trained, then a separate RL loop updates the policy, which is both compute‑heavy and fragile.

DPO skips the explicit reward model and directly adjusts the policy by treating the policy itself as an implicit reward estimator.

Compute log‑ratios: $\log\frac{\pi_{\theta}(y_{w})}{\pi_{\text{ref}}(y_{w})}= \log\frac{0.3}{0.4}= -0.287$, $\log\frac{\pi_{\theta}(y_{l})}{\pi_{\text{ref}}(y_{l})}= \log\frac{0.7}{0.6}= 0.154$.

Form the difference: $-0.287 - 0.154 = -0.441$.

Apply sigmoid: $\sigma(-0.441)=0.391$.

Take negative log: $-\log(0.391)=0.94$ — this is the contribution of this pair to $L_{\text{DPO}}$.

Gradient step (simplified): increase $\pi_{\theta}(y_{w})$ and decrease $\pi_{\theta}(y_{l})$ proportionally to the factor $0.391$.

The loss penalizes the policy when it assigns relatively lower probability to the preferred completion than the reference does, and the penalty magnitude is modulated by how far the policy deviates from the reference.

How does DPO differ from the usual RLHF pipeline that first learns a reward model?

In RLHF the reward model is a separate neural network trained on preferences; DPO treats the policy’s own log‑ratio to a fixed reference as the reward, eliminating the extra model and the RL loop.

The KL term limits how far the optimized policy can drift from a trusted reference, preventing catastrophic forgetting of the base language modeling ability.

Why not set $\beta$ to a very large value and ignore the KL‑constraint altogether?

Without the constraint the policy can overfit the limited preference set, leading to degenerate language (e.g., repeating the preferred token) and loss of general language modeling quality.

It converts a pairwise reward difference into a probability that a human prefers one completion over another.

Is the Bradley‑Terry model just a logistic regression on reward differences?

Conceptually yes, but in DPO the “features” are the log‑ratios of policy probabilities to a reference, not raw reward scores. This re‑parameterization is what removes the need for a separate reward network.

Sample two completions $y_{1},y_{2}\sim\pi_{\text{ref}}(\cdot\mid x)$ for each prompt $x$.

Collect human preference labels to form the offline dataset $D=\{(x,y_{w},y_{l})\}$.

Initialize $\pi_{\theta}$ (often from $\pi_{\text{ref}}$) and set $a$ temperature $\beta$.

Minimize $L_{\text{DPO}}$ on $D$ using stochastic gradient descent.

Optionally fine‑tune $\beta$ or the reference policy if a better anchor becomes available.

Gradient of the DPO loss for a single preference pair.

**Table 3.** Unlikelihood samples from TL;DR prompts sampled at temperature 1.0. In general, we find unlikelihood fails to generate meaningful responses for more complex problems such as summarization and dialogue.

Theoretical Foundations

We derive the optimal policy directly from preferences and expose why actor‑critic methods can be unstable.

Standard RLHF pipelines must train a separate reward model and then run an RL algorithm such as PPO. DPO sidesteps both steps by re‑parameterizing the reward so that the optimal policy becomes analytically tractable.

The optimal policy can be written as a softmax over a reward that is itself a log‑ratio between the policy and a fixed reference policy.

Compute the unnormalized policy: $\exp(r(x,y_1))=2$, $\exp(r(x,y_2))=1$.

Normalize: $\pi(y_1|x)=\frac{2}{2+1}= \tfrac{2}{3}$, $\pi(y_2|x)=\frac{1}{3}$.

Form the reparameterized reward: $r'(x,y_1)=\log\frac{\tfrac{2}{3}}{0.5}= \log\frac{4}{3}\approx 0.287$, $r'(x,y_2)=\log\frac{\tfrac{1}{3}}{0.5}= \log\frac{2}{3}\approx -0.405$.

Check the equivalence condition: $r'(x,y_i)-r(x,y_i)$ equals a constant $f(x)=\log\frac{1}{0.5}= \log 2$ for both $i$, confirming they lie in the same class.

The reparameterized reward reproduces the same optimal policy while differing from the original reward only by an $x$‑only offset, illustrating the equivalence class concept.

Under mild assumptions, every reward class consistent with the Plackett‑Luce (Bradley‑Terry) model can be expressed as $r(x, y)=\beta\log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}$ for some policy $\pi$ and reference $\pi_{\text{ref}}$.

Start from any reward $r(x,y)$ and its induced optimal policy $\pi_r(y|x)$ (Eq. 4).

Define the projection operator $f$ that subtracts the log‑partition term scaled by $\beta$.

Replace $r$ with the expression from Eq. 5 (the DPO loss) inside $f$.

Thus $f(r; \pi_{\text{ref}}, \beta)$ is a reward in the same equivalence class as $r$ and has the desired log‑ratio form.

Actor‑critic methods such as PPO introduce a learned value baseline to reduce gradient variance, but the baseline itself can be noisy and destabilize training. In contrast, the DPO reparameterization embeds the normalizing term analytically, eliminating the need for an external baseline.

Empirical Results

DPO outperforms PPO across all evaluated tasks without any RL step.

DPO matches or exceeds PPO performance while completely avoiding a reinforcement‑learning loop.

Across sentiment generation, summarization, and dialogue, DPO’s win rates are 4–10 percentage points higher than PPO’s best‑tuned runs.

**Table 1.** GPT-4 win rates vs. ground truth summaries for out-of-distribution CNN/DailyMail input articles.

**Table 2.** Comparing human and GPT-4 win rates and per-judgment agreement on TL;DR summarization samples. Humans agree with GPT-4 about as much as they agree with each other. Each experiment compares a summary from the stated method with a summary from PPO with temperature 0.

DPO matches or exceeds PPO performance without any RL component.

Supplementary Empirical Results

Additional empirical results: best‑of‑N performance, DPO vs PPO judgments, and human study details.

We present three groups of additional results: the strength of a Best‑of‑N sampling baseline, side‑by‑side DPO versus PPO (and GT) judgments, and details of our human preference study.

| Category | Content | | :--- | :--- | | **Prompt** | SUBREDDIT: r/AskReddit TITLE: I’ve been ungrateful to parents who have wanted only the best for me for the longest time. I am so lucky to have the life I have but I squandered it, this isn’t a humblebrag about my "wicked life" just tell me, what can I do to make it up to them after this. POST: I’m 17, leaving for University, and I’ve had epiphany after epiphany of how shitty i’ve beent to them, or how all the nice things they do I ruin with moods etc., I know there isn’t a quick fix, but I want to do something nice. I’ve always thanked them and apologized where it is due (in earlier years not so much) but as they say, actions speak louder than words, so I was wondering what I can do now, before I go, to just make them happy. What can I keep doing, I know all the little things, doing chores, keeping my room clean, cleaning up after myself, helping around the house, but what’s a big thing, for say christmas, or their birthdays, that I can do to make them happy? | | **DPO** | TL;DR: I’ve been a shitty child and I want to make up for it, what can I do for my parents on their birthdays and christmas to show them how much I love them? | | **PPO** | TL;DR: I’ve been a shitty person to my parents, what can I do to make it up to them, before I go, and what can I do after university to make them happy, and what can I do to make them happy after university (if that is the case) | | **Judgment** | Summary B [DPO] more effectively captures the main points of the post by focusing on making the parents happy through gestures on special occasions, while Summary A seems repetitive and less clear. |

| Category | Content | | :--- | :--- | | **Prompt** | SUBREDDIT: r/relationships TITLE: I [22 M] am about to propose to my girlfriend [22F] of almost 2 years, and I can't calm myself down POST: Alright, so I am preparing to propose to my girlfriend of almost 2 years (we were friends for double that beforehand) in a couple days and I am starting to get EXTREMELY anxious/nervous. We have talked about this for months, and I know she will say yes, but I can't seem to relax. The other day I bought the ring, which I fell in love with myself and know she will too. And then two days ago, I asked her dad's blessing. Everything has gone great so far, but now leading up to it I can't think about anything else. I can't sleep. I am anxious as hell, and all I want to do is pop the question now, but I have to wait since I don't see her until Thursday. I know how I am going to do it. I have thoroughly planned it out and I am more than prepared. But, at the same time, I am freaking out. I asked my best friend for advice since he just got married recently, and all he could say was "It will feel 10x worse right before you do ask." Any suggestions on how to relax leading up to it? I would like to be able to sleep... | | **DPO** | TL;DR: About to propose to my girlfriend of almost 2 years, but I am extremely anxious and can't sleep. How can I relax? | | **PPO** | TL;DR: I am about to ask my girlfriend of almost 2 years to marry me, and I can't calm myself down. Any tips on how to relax? I would like to be able to sleep tonight (and the night before). | | **Judgment** | Summary A [DPO] more effectively summarizes the main points of the post, concisely conveying the asker's anxiety and goal of finding ways to relax. |

The table presents a comparison between two summarization methods, DPO and PPO, based on a provided Reddit prompt. The table includes four rows: Prompt, DPO, PPO, and Judgment. The DPO summary is evaluated as more accurate in capturing the user's intent regarding low-calorie pasta alternatives.

**Table 7.** GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 0.7; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

**Table 8.** GPT-4 chooses DPO over GT. Sample responses to a prompt from the Anthropic-HH test set. DPO sample generated with temperature 1.0; GT is the chosen completion in the dataset of preferences. For clarity, post-hoc annotations are included in bold, formatted as [annotation]. These annotations are not part of the model generations.

**Table 9.** GPT-4 chooses GT over DPO. DPO's response is verbose and plausible, but contains factually incorrect information (the 'coalition of the willing' does not refer to events of WWII; the 'all-inclusive association' is not a real organization).

Limitations and Discussion

We discuss DPO’s practical advantages and outline key limitations and future research directions.

Standard RLHF first learns a reward model and then applies RL, typically PPO, to train a policy. DPO proves the optimal policy can be written directly from the reward, allowing direct training with a cross‑entropy loss.

This eliminates the need for a separate reward model and RL machinery, reducing hyper‑parameter tuning while achieving comparable or better performance than PPO‑based RLHF.

A primary open question is how DPO policies generalize out‑of‑distribution compared with policies trained on an explicit reward function.

Our preliminary experiments suggest DPO generalizes similarly to PPO‑based models, but a systematic study is required.

Another avenue is whether self‑labeling from the DPO policy can effectively exploit unlabeled prompts.

We also observe a modest performance dip in the right panel of Figure 3, raising the question of reward over‑optimization in the DPO setting.

While we evaluate models up to 6 B parameters, extending DPO to state‑of‑the‑art scales remains an exciting direction.

The win‑rate scores derived from GPT‑4 vary with the prompting strategy, motivating research into more reliable automated judgment protocols.

Finally, DPO could be applied to train generative models in other modalities beyond text, such as images or audio.

Mathematical Derivations I

Derives the optimal KL‑constrained policy and the DPO loss for common preference models.

We start from the KL‑constrained reward maximization problem and rewrite it as a minimization over the same expression.

Because the KL term is a log‑ratio, the objective can be expressed with a log and the maximization sign flips to a minimization.

The partition function $Z(x)$ collects the normalizing constant that depends only on the reference policy and the reward.

Using $Z(x)$ we can write the optimal policy in closed form.

Plugging $\pi^{*}$ back into the objective yields a KL term plus a constant $-\log Z(x)$.

Since $Z(x)$ does not involve $\pi$, minimizing the KL term forces $\pi$ to equal $\pi^{*}$, giving the optimal solution.

Under the Bradley‑Terry model, the probability that response $y_{1}$ is preferred over $y_{2}$ follows a logistic form.

We can express the latent reward $r^{*}$ in terms of the optimal policy derived above.

Substituting $r^{*}$ into the Bradley‑Terry expression and cancelling the constant yields a sigmoid over log‑ratios of policies.

The Plackett‑Luce model generalizes Bradley‑Terry to full rankings of $K$ candidates.

When $K=2$ the expression collapses to the Bradley‑Terry form; substituting the reward parameterization eliminates $Z(x)$.

Given a dataset of prompts with user rankings, the DPO loss is the negative log‑likelihood of the above model.

Mathematical Derivations II

Derives the DPO gradient and proves the supporting lemmas and theorem.

We now compute the gradient of the DPO loss and use the result to establish two lemmas and the main theorem that justify the reparameterization of reward functions.

Introducing a reward substitution further streamlines the expression.

Under the Plackett–Luce (Bradley–Terry) framework, two reward functions $r$ and $r'$ that differ only by an $x$‑dependent term $f(x)$ induce identical preference distributions.

Two reward functions from the same equivalence class produce the same optimal policy in the KL‑constrained RL problem.

Assume $\pi_{\text{ref}}(y|x)>0$ for all $x,y$ and $\beta>0$. Every reward equivalence class can be expressed as $r(x,y)=\beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}$ for some policy $\pi$.

Within each equivalence class, the log‑odds reward $r(x,y)=\beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}$ is unique.

Implementation Details

Implementation specifics and hyperparameters for DPO and the experimental setups.

We implement DPO directly in PyTorch; the loss follows the formulation from §2 and requires only policy and reference log‑probabilities together with the indices of the preferred and dispreferred completions.

DPO loss implementation.

Unless otherwise noted we use $\beta$ = 0.1, a batch size of 64, and RMSprop with a learning‑rate of 1e‑6. The learning‑rate is linearly warmed from 0 to 1e‑6 over the first 150 optimization steps.

For the TL;DR summarization task we increase the KL temperature to $\beta$ = 0.5 while keeping all other settings identical.

In the IMDb sentiment experiment prompts are short prefixes (2–8 tokens) drawn from the IMDB dataset. We employ the pre‑trained sentiment‑roberta‑large‑english model as a ground‑truth reward model and gpt2‑large as the base language model.

We first fine‑tune the base model on a subset of IMDb for a single epoch. Then we sample four completions for each of 25 000 prefixes and construct six preference pairs per prefix using the reward model scores.

The RLHF reward model is initialized from the same gpt2‑large checkpoint and trained for three epochs on the generated preference data; we keep the checkpoint with the highest validation accuracy. The “TRL” run follows the hyper‑parameters of the TRL library, and each PPO step processes 1 024 samples.

Win‑rate judgments for summarization and dialogue are obtained via GPT‑4 (model gpt‑4‑0314). For summarization we use two prompt variants (S and C) that first request a one‑sentence comparison and then a single‑letter choice. The dialogue prompt similarly asks for a one‑sentence helpfulness comparison followed by an “A”/“B” answer.

**Figure 4.** Best of $N$ baseline for $N = \{1, 4, 16, 64, 128\}$. Performance plateaus after roughly 64-128 samples.

**Figure.** A screenshot of a summarization evaluation task interface. The task asks the user to compare two summaries (Summary A and Summary B) of a provided forum post about a relationship conflict involving a partner's collection of toys and video games. The interface includes radio buttons for selecting the better summary or indicating that the summaries are nearly identical.

We also report an unlikelihood baseline that maximizes log p($y_w$ | x) while minimizing log p($y_l$ | x). It is omitted from summarization and dialogue experiments because it yields incoherent outputs, a symptom of unconstrained likelihood minimization.

Read the original paper

Open the simplified reader on Paperglide