Self-Rewarding Language Models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

Self-Rewarding Language Models use iterative LLM-as-a-Judge prompting to generate their own training rewards.

Can a language model improve its own alignment by acting as both the generator and the reward-model judge during iterative training?

Current alignment methods rely on fixed reward models trained from static human preferences, which creates a performance bottleneck and prevents the reward model from improving alongside the language model. Self-Rewarding Language Models solve this by using the model itself to generate and evaluate its own training data via LLM-as-a-Judge prompting, creating a virtuous cycle of improvement. Fine-tuning Llama 2 70B with this iterative approach yields a model that outperforms systems like Claude 2 and GPT-4 0613 on the AlpacaEval 2.0 leaderboard.

Paper Primer

The core mechanism is an iterative training loop where the model acts as both the instruction-follower and the reward-provider. In each iteration, the model generates candidate responses to new prompts, scores them using a specific additive-criteria prompt, and uses these preference pairs to train the next iteration via Direct Preference Optimization (DPO).

Iterative self-rewarding training significantly improves instruction-following performance.

Head-to-head evaluations against the SFT baseline show consistent win-rate gains across three iterations, reaching 62.5% for the final model (M3). The M3 model achieves a 20.44% win rate against GPT-4 Turbo on the AlpacaEval 2.0 leaderboard, surpassing several proprietary models.

The model's ability to provide high-quality rewards improves alongside its instruction-following capability.

Pairwise accuracy against held-out human preference data increases from 78.7% in Iteration 1 to 81.7% in Iteration 3. The model demonstrates a consistent upward trend in reward-modeling metrics (Spearman correlation, Kendall’s $\tau$) without additional human-labeled evaluation data.

Why use this iterative approach instead of just training a better reward model once?

Standard reward models are frozen after training, limiting their performance to the quality of the initial human data. This method allows the reward model to update and improve during training, potentially exceeding the performance ceiling of the original human-authored seed data.

Does this approach work for all types of tasks?

The authors observe that improvements are less pronounced in mathematics and logical reasoning, likely because the seed data from Open Assistant under-emphasizes these domains. The method currently excels at general instruction following rather than specialized reasoning tasks.

The Case for Self-Rewarding Models

We expose the RLHF human‑feedback bottleneck and motivate a self‑rewarding loop that lets LLMs train their own reward model.

Current RLHF pipelines rely on human‑generated preference data to train a reward model, then freeze that model while the LLM learns via reinforcement learning. This creates a two‑fold bottleneck: the limited scale of human feedback and the inability of a frozen reward model to improve as the LLM evolves. To break this, we propose a self‑rewarding loop where the model itself judges its outputs, eliminating the need for additional human input.

The core problem is that reward models trained on finite human preferences cannot surpass human performance, and once frozen they cannot adapt to the LLM’s changing behavior.

Compute $N^2$: $8^2 = 64$ entries.

Multiply by 4 bytes per entry: $64 \times 4 = 256$ bytes total.

If $N$ grows to $16\,384$, the map size becomes $16\,384^2 \approx 2.68\times10^8$ entries, i.e. roughly 1 GB of memory, which quickly exceeds typical GPU limits.

This scaling illustrates why relying on a frozen reward model that must materialize large attention structures becomes a memory bottleneck, motivating a self‑rewarding approach that avoids such costly intermediate representations.

The human‑feedback bottleneck in RLHF limits both the quality and scalability of LLM alignment.

The Self-Rewarding Loop

Self‑rewarding models let the LLM generate and judge its own training data, removing the need for external human feedback.

Self‑rewarding models aim to eliminate the costly human‑feedback loop by letting the model generate and judge its own training data. We first describe the core loop, then detail the steps that enable iterative improvement.

The model generates instruction‑following responses, then uses the same generation mechanism to score those responses, feeding the scores back as preference data for the next training round.

The model evaluates $y^{1}$, producing a high reward $4.8$ because the translation is exact.

The model evaluates $y^{2}$, producing a low reward $2.3$ because “chaton” means “kitten”, not “cat”.

These scores form a preference pair $(\text{prompt},\; y^{\text{win}}=\text{"chat"},\; y^{\text{lose}}=\text{"chaton"})$.

The pair is added to the training set for the next DPO fine‑tuning step.

After fine‑tuning, the updated model $M_{t+1}$ is more likely to output “chat” on similar prompts.

This tiny toy illustrates how the loop creates useful supervision without any human‑written label for the new example.

How does this differ from standard RLHF where a separate reward model is trained on human preferences?

In RLHF the reward model is frozen after supervised training, so its judgments never improve. Here the same model both generates responses and predicts their scores, so the reward function is updated each iteration alongside the policy.

Fine‑tune the base LLM on a small set of human‑written instruction–response pairs so it learns to obey user prompts.

Train the model to output a quality score and reasoning for a given response, using prompts that ask it to evaluate itself.

After the model can judge its own outputs, we create preference pairs from its highest‑scoring and lowest‑scoring responses and fine‑tune the model with DPO.

Generate a new prompt $x_i$ by few‑shot sampling from the seed IFT data.

Sample $N$ diverse candidate responses $y_i^{1},\dots,y_i^{N}$ from the current model $M_t$.

Prompt $M_t$ (LLM‑as‑a‑Judge) to score each candidate, producing rewards $r_i^{j}\in[0,5]$.

Start from the base pretrained model $M_0$.

Fine‑tune on IFT + EFT seed data to obtain $M_1$.

For $t=1$ to $T\!-\!1$:

**Figure 1.** **Self-Rewarding Language Models.** Our self-alignment method consists of two steps: (i) *Self-Instruction creation*: newly created prompts are used to generate candidate responses from model $M_t$, which also predicts its own rewards via LLM-as-a-Judge prompting. (ii) *Instruction following training*: preference pairs are selected from the generated data, which are used for training via DPO, resulting in model $M_{t+1}$. This whole procedure can then be iterated resulting in both improved instruction following and reward modeling ability.

Experimental Setup

We detail the data, metrics, and training pipeline that enable self‑rewarding experiments.

Our experiments start from the pretrained Llama 2 70B model. We fine‑tune it on human‑authored instruction data (IFT) and on evaluation data (EFT) to obtain the SFT baseline and the reward model, respectively.

Supervised fine‑tuning on the high‑quality instruction examples gives a model that can follow human prompts without any reward‑model feedback.

Initialize the model parameters randomly.

Compute cross‑entropy loss on the target tokens of example 1 (token “chat”).

Update parameters with learning rate $5.5\!\times\!10^{-6}$.

Repeat steps 2‑3 for example 2 (token “5”).

After one epoch, the model predicts “chat” and “5” with > 90 % confidence.

The toy fine‑tuning shows how a few supervised examples can instantly endow the model with the ability to produce correct outputs, illustrating the SFT baseline’s role as a reliable starting policy.

We assess models along two axes. Instruction following is measured with GPT‑4 as a judge on 256 prompts, plus AlpacaEval 2.0 (805 prompts), MT‑Bench (multi‑turn questions), and nine standard NLP benchmarks. Reward modeling is evaluated by correlating model scores with human rankings on the Open Assistant evaluation set, reporting pairwise accuracy, exact‑match ordering, Spearman, Kendall’s $\tau$, and the frequency of perfect‑5 predictions.

SFT training uses a cosine‑decayed learning rate from $5.5\!\times\!10^{-6}$ to $1.1\!\times\!10^{-6}$, batch size 16, dropout 0.1, and loss on target tokens only. DPO training follows a similar schedule with an initial rate $1\!\times\!10^{-6}$ decaying to $1\!\times\!10^{-7}$, the same batch and dropout, and a $\beta$ = 0.1 regularizer; checkpoints are saved every 200 steps and evaluated with Claude 2 on 253 validation examples.

Self‑instruction creation uses a fixed Llama 2‑Chat 70B model with 8‑shot prompting: six demonstrations from IFT and two from the model itself, decoded with temperature 0.6 and top‑p 0.9. Generated prompts are filtered by ROUGE‑L similarity, keyword presence, and length. For each prompt we sample $N=4$ candidate responses (temperature 0.7, top‑p 0.9) and evaluate each three times, averaging the scores.

We add 3,964 preference pairs to form the AIFT(M1) dataset and 6,942 pairs for AIFT(M2), which are then used to train the next model in the self‑rewarding chain via DPO.

Instruction Following Performance

Self‑Rewarding iterations steadily raise instruction‑following win rates.

Iteration 2 (M2) outperforms Iteration 1 (M1) in head‑to‑head instruction following, winning 55.5% of prompts.

M2 wins 55.5% versus M1’s 11.7% in the pairwise evaluation.

**Figure 3.** Instruction following ability improves with Self-Training: We evaluate our models using head-to-head win rates on diverse prompts using GPT-4. The SFT Baseline is on par with Self-Rewarding Iteration 1 ($M_1$). However, Iteration 2 ($M_2$) outperforms both Iteration 1 ($M_1$) and the SFT Baseline. Iteration 3 ($M_3$) gives further gains over Iteration 2 ($M_2$), outperforming $M_1$, $M_2$ and the SFT Baseline by a large margin.

**Table 1.** AlpacaEval 2.0 results (win rate over GPT-4 Turbo evaluated by GPT-4). Self-Rewarding iterations yield improving win rates. Iteration 3 ($M_3$) outperforms many existing models that use proprietary training data or targets distilled from stronger models.

**Figure 5.** **Human evaluation results.** Iterations of Self-Rewarding ($M_1$, $M_2$ and $M_3$) provide progressively better head-to-head win rates compared to the SFT baseline, in agreement with the automatic evaluation results.

Analysis of Self-Alignment

Self‑Rewarding lets the model generate its own training signals, cutting the need for extra human feedback.

The self‑rewarding loop replaces costly human preference collection with model‑generated signals, enabling iterative improvement without extra annotation.

**Figure 8.** EFT data helps the self-rewarding loop: We evaluated the series of models trained using self-reward loops starting from the model trained using only IFT data. We performed head-to-head win rates comparisons on the IFT test set. While $M'_2$ can improve over the SFT baseline and $M'_3$ can improve even more over the SFT baseline, they lag far behind the corresponding models ($M_2$, $M_3$) that started from a base model trained using both IFT and EFT data, see Figure 3.

Removing EFT data (i.e., training without the EFT component) drops win rate by roughly 12 %.

Figure 8’s bottom chart shows the EFT‑augmented model M₃ wins 38.7 % against M′₃, whereas M′₃ alone only achieves 50.4 % against the SFT baseline, a ≈12 % relative decline.

Omitting the self‑rewarding loop (using only IFT) reduces pairwise accuracy by 13 %.

Table 4 reports pairwise accuracy of 81.7 % for the full self‑rewarding pipeline (M₃) versus 78.7 % for the IFT‑only baseline.

Without the AIFT component, 5‑best % falls by 2 %.

Table 4 shows 5‑best % of 41.5 % for M₁ (with AIFT) versus 39.6 % for the IFT‑only baseline.

Exact Match % drops by 3 % when AIFT is removed.

Table 4 lists Exact Match of 13.1 % for M₁ versus 10.1 % for the IFT‑only model.

Spearman correlation declines by 0.07 without AIFT.

Table 4 records Spearman 0.349 for M₃ versus 0.279 for the IFT‑only baseline.

Kendall $\tau$ correlation falls by 0.09 when AIFT is omitted.

Table 4 gives Kendall $\tau$ of 0.324 for M₃ compared to 0.233 for the IFT‑only model.

**Figure 11.** AlpacaEval win rate breakdown for instruction complexities (left) and expected response lengths (right). Self-Rewarding models give gains across most complexities and all response length ranges.

**Figure 4.** AlpacaEval win rate breakdown for instruction categories (full names given in Appendix). Self-Rewarding models give gains across several topics, but tend to e.g. give less gains on mathematics and reasoning tasks.

Reward Modeling and Prior Work

Self‑Rewarding models achieve higher reward‑model accuracy without extra human data.

After three self‑rewarding iterations the model attains 81.7 % pairwise accuracy, surpassing the 78.7 % of the second iteration without additional EFT data.

Table 4 reports Model M3 (Iteration 3) achieving 81.7 % versus Model M2’s 78.7 %.

Self‑reward training further improves reward modeling. Using the reward model from iteration 1 to train iteration 2 raises pairwise accuracy to 80.4 %, and a third iteration pushes it to 81.7 % despite no new EFT data.

Related work spans RLHF methods that train a fixed reward model from human preferences, DPO approaches that skip reward‑model training, and RLAIF techniques that use an LLM as a judge. Iterative schemes such as PCO and ReST also curate data with a fixed reward, but our self‑rewarding loop avoids the extra computational cost of a separate reward model.

Limitations and Conclusion

We outline the current gaps and open questions for Self‑Rewarding models.

Self‑Rewarding models learn by generating their own preference‑based data and training on it. This section revisits the premise that the model acts as its own judge, then enumerates the remaining open questions.

Our experiments covered only three iterations in a single configuration, so the observed improvements may not generalize. A key unanswered question is how the benefit scales with more iterations or with models of varying capability.

We also noticed that longer generations tend to receive higher estimated quality scores, a correlation that could mask underlying issues. Moreover, the possibility of “reward‑hacking” – the model learning to game its own reward signal – has not been investigated.

Safety evaluation is another missing piece; existing reward models are trained explicitly for safety, yet we have only used LLM‑as‑a‑Judge for evaluation. Future work should embed safety‑focused judging into the self‑rewarding loop and examine whether later iterations can mitigate harder safety failures.

Data Distributions

Appendix details data distributions, prompt variants, and additional experimental breakdowns.

Section A.1 visualizes the instruction and response distributions of IFT, EFT, and AIFT data. The two scatter plots reveal that EFT data occupy a distinct dense region, while IFT and AIFT share a broader, overlapping space.

**Figure 1.** (a) Instruction distribution of IFT, EFT and AIFT data. (b) Response distribution of IFT, EFT, and AIFT data.

Section A.2 compares two EFT prompting strategies: the original multiple‑choice prompt from Li et al. [2024] and our additive score‑counting prompt. The latter yields substantially higher pairwise accuracy and correlation metrics.

**Table 5.** We tried various LLM-as-Judge prompts using the model trained with 3,200 IFT data only and found that our additive score-counting prompt worked best which demonstrates significant improvements in EFT performance comparing to the prompt used by Li et al. [2024].

Section A.3 reports experiments where models are trained only on IFT data, then iteratively refined with AIFT via DPO. Without EFT data, the models struggle to assign meaningful scores, often converging to a default value of 4.

**Table 10.** MT-Bench Results

Tables 7 and 8 further detail instruction complexity and expected response length. Complexity peaks at level 1 (29.57 % of examples), while most responses are short (1‑3 sentences, 44.84 %).

Section A.4 notes that a simple self‑training pipeline that augments SFT with high‑quality self‑generated instructions, but without preference optimization, underperforms the full self‑rewarding loop.

Additional Training Details

Appendix B details extra experiments, data analyses, and benchmark results for the Self‑Rewarding models.

In the DPO variant we augment the supervised‑fine‑tuning seed set with additional (instruction, response) pairs that the model itself rated as perfect ($r_i = 5$). Despite adding 11,254 such examples and tuning the mixing weight, the resulting win rate against the SFT baseline was $29\%$ vs $30\%$, indicating no measurable improvement.

For efficiency we pre‑compute a pool of augmented prompts using ChatLlama 70B; in an interactive setting these could be supplied by real users. We also test whether the newly trained Self‑Rewarding models can generate prompts via in‑context learning, constructing 30 prompts from the original IFT seed data. Manual inspection shows all three models (M1, M2, M3) can produce novel instructions, though M2 and M3 often emit a separator before the responses, requiring post‑processing.

We cluster the AlpacaEval test set using GPT‑4 into three perspectives: instruction category, instruction complexity, and expected response length. Prompts from Figure 9 yield 20 categories, while prompts from Figure 10 assign each example to a complexity or length cluster; the resulting statistics appear in Tables 6–8, with fine‑grained visualisations in Figure 11.

**Table 7.** Breakdown of AlpacaEval test set instructions by instruction complexity. The instructions increase in complexity from 1 to 9, where 10 is a complex question that requires first reasoning or breaking the problem into sub-problems before it can be solved.

**Table 8.** Breakdown of AlpacaEval test set instructions by expected response length.

**Table 10.** MT-Bench Fine-grained Results. We list our models' performance on each problem category. Self-reward is especially effective in improving the model's ability in writing, role-playing, extraction, and STEM tasks.

Table 9 aggregates NLP benchmark results, showing that Self‑Rewarding models generally preserve the performance of the Llama 2 base and the SFT baseline despite being fine‑tuned on markedly different instruction prompts.

For multiple‑choice benchmarks (ARC‑Challenge, HellaSwag, SIQA, PIQA, OBQA) we select the answer with the highest log‑probability, a metric that differs from the reward‑model objective; consequently, these scores may not fully reflect the models’ true capabilities.

Read the original paper

Open the simplified reader on Paperglide