RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash

RLAIF matches RLHF performance across summarization and dialogue tasks, enabling scalable alignment without human labels.

Can we replace expensive human preference labels in RLHF with labels generated by a pre-trained LLM (RLAIF) without sacrificing model performance?

Reinforcement Learning from Human Feedback (RLHF) is the standard for aligning language models, but it is bottlenecked by the high cost and slow speed of human annotation. Reinforcement Learning from AI Feedback (RLAIF) replaces human annotators with an off-the-shelf Large Language Model (LLM) that generates preference labels to train a reward model, or provides rewards directly during reinforcement learning. Across summarization and dialogue tasks, RLAIF achieves performance comparable to RLHF, while also demonstrating the ability to improve models using only AI-generated feedback.

Paper Primer

The core mechanism involves prompting an LLM to compare two candidate responses and output a preference distribution. This distribution is either distilled into a reward model (canonical RLAIF) or used directly as a reward signal during reinforcement learning (direct-RLAIF).

RLAIF achieves performance parity with RLHF.

Human evaluators preferred RLAIF over a supervised fine-tuned baseline at rates statistically indistinguishable from RLHF across summarization and helpful dialogue tasks. RLAIF and RLHF were preferred over the baseline 71% and 73% of the time for summarization, and 63% and 64% for helpful dialogue.

Direct-RLAIF (d-RLAIF) outperforms canonical RLAIF.

By bypassing reward model training and querying the LLM directly during reinforcement learning, d-RLAIF avoids reward model staleness and improves performance. Annotators preferred d-RLAIF over canonical same-size RLAIF 60% of the time.

Why does this approach matter if RLHF is already effective?

RLHF is limited by the scalability and cost of human labeling. RLAIF offers a significantly cheaper (estimated 10x lower cost) and faster alternative that does not rely on human annotators.

What is the scope of this study?

The authors evaluate RLAIF on three specific tasks: summarization, helpful dialogue generation, and harmless dialogue generation. They note that while RLAIF is promising, human experts remain the gold standard for high-stakes domains like medicine or law.

Researchers can now treat AI feedback as a viable, scalable substitute for human labels in alignment pipelines, with direct-RLAIF providing a more efficient, training-free alternative to traditional reward modeling.

Introduction

We expose the human‑label bottleneck and show how AI‑generated feedback can replace it.

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models, yet its reliance on expensive human preference labels creates a severe scalability bottleneck. By substituting these labels with those generated by a pre‑trained LLM, Reinforcement Learning from AI Feedback (RLAIF) promises a path to scalable alignment without sacrificing performance.

RLAIF trains a reward model on preferences that an off‑the‑shelf LLM produces, eliminating the need for costly human annotations.

Collecting $10{,}000$ human labels would need $10{,}000 \times 256\text{ KB} \approx 2.5\text{ GB}$ of storage and thousands of hours of annotator time.

Generating the same number of AI labels costs only the compute to run the LLM, roughly $10{,}000$ forward passes, which on a single GPU finishes in minutes and consumes negligible storage.

The cost ratio therefore exceeds $10^3\!:\!1$ in favor of AI‑generated feedback.

This toy calculation shows why human labeling quickly becomes the dominant expense as model size and dataset scale grow.

Our experiments demonstrate three key findings: (1) RLAIF attains performance on par with RLHF across three benchmark tasks; (2) RLAIF can improve over a supervised fine‑tuned baseline even when the labeler is the same model checkpoint; and (3) direct‑RLAIF (d‑RLAIF), which bypasses reward‑model training, matches or exceeds canonical RLAIF.

The scalability bottleneck of human‑in‑the‑loop alignment can be removed by replacing human preference labels with LLM‑generated feedback.

The RLAIF Pipeline

How we let a generic LLM generate reliable preference labels for RL training.

Human preference labeling is a major cost driver for RLHF pipelines. Replacing humans with a generic LLM promises scalability without sacrificing alignment quality.

We ask an off‑the‑shelf LLM to rank two candidate outputs, then turn the LLM’s token‑level log‑probabilities for “1” and “2” into a soft preference distribution.

Exponentiate: $\exp(-1.2)=0.30$, $\exp(-0.5)=0.61$.

Sum the exponentials: $0.30+0.61=0.91$.

Normalize: $P(1|x)=0.30/0.91\approx0.33$, $P(2|x)=0.61/0.91\approx0.67$.

The higher probability (0.67) indicates the LLM prefers candidate 2, while the 0.34 gap shows moderate confidence.

The distribution tells us not only which candidate wins but also how strongly the model leans toward it, which a binary label would hide.

Why is a soft probability vector preferable to a hard “choose 1 or 2” label?

Because the vector preserves the model’s confidence: a small gap (e.g., $[0.52,0.48]$) signals uncertainty, allowing the RL algorithm to weight the reward less aggressively, whereas a hard label would treat both cases as equally certain.

LLMs are sensitive to the order in which candidate responses appear, a phenomenon known as position bias. If left unchecked, the bias can systematically favor the first‑presented candidate.

Run the LLM on the prompt with candidate A first and candidate B second; obtain distribution $P_{AB}$.

Run the LLM again with the order swapped (candidate B first, A second); obtain distribution $P_{BA}$.

Average the two distributions: $P_{\text{final}} = \frac{1}{2}(P_{AB}+P_{BA})$.

Why does simple averaging remove the bias instead of just picking the higher‑scoring run?

Because the bias is deterministic with respect to position: the same LLM will consistently favor the first slot. Averaging neutralizes that deterministic component, whereas picking the higher‑scoring run would retain the bias.

To obtain richer judgments we can ask the LLM to explain its reasoning before scoring. This two‑step chain‑of‑thought (CoT) procedure yields more informed preference distributions.

Replace the final “Preferred Response=” token with a request for a rationale (e.g., “Explain which summary is better and why:”).

Generate the CoT response from the LLM.

Concatenate the original prompt, the CoT response, and the original ending token.

Run the scoring step (log‑prob extraction and softmax) on this augmented prompt to obtain $P$.

What advantage does the CoT step give over a plain “which is better?” prompt?

CoT makes the model articulate criteria such as coherence, accuracy, and coverage; those criteria influence the token probabilities, so the final preference distribution encodes a more nuanced assessment than a single‑question prompt.

We assess the quality of AI‑generated preferences with three metrics. AI Labeler Alignment measures agreement with human judgments; Win Rate quantifies how often a policy beats a baseline in human preference tests; Harmless Rate reports the fraction of outputs judged safe.

**Figure 2.** A diagram depicting RLAIF (top) vs. RLHF (bottom)

**Figure 3.** An illustration of the process to obtain AI-generated preference labels for summarization. The LLM is first prompted to explain its thoughts on the quality of the two candidates (blue). The response (orange) is then appended to the first prompt, and together they form the second prompt used to generate a preference distribution over “1” vs. “2” (green).

Table 15 juxtaposes example summaries produced by RLAIF and RLHF across three scenarios, illustrating that the AI‑feedback model generates comparable, sometimes more concise, outputs while preserving content fidelity.

Experimental Setup and Direct-RLAIF

We detail the direct‑RLAIF trick, its prompting variants, and the experimental setup.

Canonical RLAIF suffers from reward‑model staleness: as the policy improves, the fixed reward model sees out‑of‑distribution outputs, degrading its guidance.

Instead of training a separate reward model, we ask the off‑the‑shelf LLM itself to score each generated response on a 1‑to‑10 scale, and feed that score straight to the RL update.

Step 1: The policy samples “The cat sleeps.”

Step 2: The LLM receives the prompt and outputs $r = 7$.

Step 3: REINFORCE computes the gradient $\nabla \log \pi_\theta(\text{output}) \cdot (r - b)$, where $b$ is a baseline (e.g., running average of past rewards).

Step 4: The policy parameters $\theta$ are updated in the direction of the gradient, increasing the probability of generating higher‑rated sentences.

Step 5: In the next iteration the policy may produce “The cat is sleeping peacefully.” The LLM now rates it 9, reinforcing the improvement.

Direct‑RLAIF replaces a learned reward model with a live LLM scorer, so the reward signal stays aligned with the policy’s current output distribution.

How does Direct‑RLAIF differ from the standard RLHF pipeline that also uses a reward model?

In RLHF the reward model is a separate neural network trained on human‑annotated preferences; it must be retrained whenever the policy drifts. Direct‑RLAIF skips this model entirely— the LLM itself provides the reward at every step, so there is no stale‑model to update.

We experiment with several prompt templates that ask the LLM to rate quality, to rank multiple candidates, or to produce a binary “good/bad” signal; each variation changes how the reward is extracted from the LLM.

Modified REINFORCE loop used for Direct‑RLAIF.

**Figure 4.** In direct-RLAIF (d-RLAIF), the off-the-shelf LLM is directly used to provide rewards during RL, circumventing the issue of RM “staleness” and the time consuming process of RM training.

Performance Evaluation

RLAIF matches RLHF performance across summarization and dialogue tasks while cutting labeling cost tenfold.

RLAIF matches RLHF performance across summarization and helpful‑dialogue tasks.

Human win rates: 71 % (RLAIF) vs 73 % (RLHF) on summarization; 63 % vs 64 % on helpful dialogue. Harmless rate: 88 % (RLAIF) vs 76 % (RLHF).

Human judges compare two model outputs and pick the one they prefer, yielding a win‑rate that directly reflects perceived quality.

How does human evaluation differ from automatic metrics like BLEU?

Human judges assess overall usefulness, factual correctness, and safety, while BLEU only measures n‑gram overlap and can miss substantive errors.

**Figure 1.** Human evaluators strongly prefer RLAIF and RLHF over the SFT baseline for summarization and helpful dialogue generation. Furthermore, when compared head-to-head, RLAIF is equally preferred to RLHF. For harmless dialogue generation, RLAIF outperforms RLHF.

**Figure 6.** Example summaries generated by SFT, RLHF, and RLAIF policies for a Reddit post. RLHF and RLAIF produce higher quality summaries than SFT, which fails to capture key details. Salient details are in bold.

Table 1 reports win‑rates for RLAIF, RLHF, and SFT across summarization and helpful dialogue, as well as harmless rates for dialogue. RLAIF attains 71 % vs 73 % (RLHF) on summarization and 63 % vs 64 % on helpful dialogue, while achieving an 88 % harmless rate, substantially above RLHF’s 76 %.

Table 2 evaluates three prompting variations—preamble specificity, chain‑of‑thought (CoT), and in‑context learning. The best prompts improve AI‑labeler alignment by +1.9 % for summarization, +1.3 % for helpfulness, and +1.7 % for harmlessness relative to the base 0‑shot prompt.

Table 3 demonstrates a clear scaling trend: larger labeler models yield higher alignment scores (78 % for PaLM 2 L, 73.8 % for PaLM 2 S, 62.7 % for PaLM 2 XS), confirming that model size directly impacts label quality.

RLHF Preliminaries

Key background on the RLHF pipeline, position bias, and training specifics.

RLHF depends on costly human preference labels; RLAIF swaps those for LLM‑generated labels, keeping performance while scaling alignment.

RLHF builds a model in three stages: first it fine‑tunes a pretrained LLM on labeled data, then it learns a reward model from human‑ranked response pairs, and finally it refines the policy with reinforcement learning guided by that reward model.

Our analysis of LLM labelers reveals a pronounced position bias: smaller models tend to keep the same candidate position even after the order of candidates is swapped.

**Table.** The table shows the percentage of times different sizes of PaLM 2 models prefer the same position when candidates are swapped in a summarization task.

For summarization we use the filtered Reddit TL;DR corpus (123 k posts, 5 % held out as validation). For dialogue we employ Anthropic’s Helpful and Harmless preference sets, each containing roughly 40 k training examples and 2 k test examples.

LLM labeling runs with a 4096‑token context window. Chain‑of‑thought scores are generated by greedy decoding (temperature $0.0$). For self‑consistency experiments we vary temperature from $0.3$ to $1.0$ and use top‑$K$ sampling with $K=40$.

REINFORCE treats text generation as a finite‑horizon MDP, assigning a reward only at the final token and using the policy gradient to update the model parameters.

SFT models for summarization are trained for one epoch on Reddit TL;DR with batch size 128, using the Adafactor optimizer at a learning rate of $10^{-5}$; input and output lengths are capped at 1024 and 128 tokens respectively. For helpful/harmless dialogue we fine‑tune an instruction‑tuned PaLM 2 XS model.

Task-Specific Adjustments

Ablations expose how length control, feedback mixing, and labeling cost affect performance.

Controlling summary length drops win rates by ~12 % for both RLAIF and RLHF when compared to SFT.

Table 8 shows RLAIF vs SFT falling from 71 % to 59 % and RLHF vs SFT from 73 % to 61 % after length correction.

Length correction reduces RLHF’s win rate by ~3 % and also lowers the RLAIF‑vs‑RLHF gap by the same amount.

Table 8 (RLHF vs SFT 73 %→61 %) and Table 8 (RLAIF vs RLHF 50 %→47 %) both lose 3 % after correction.

In helpful dialogue, length control cuts RLAIF’s advantage over SFT and over RLHF by ~2 % each.

Table 9 reports RLAIF vs SFT 63 %→61 % and RLAIF vs RLHF 52 %→50 % after correction.

When measuring harmlessness, length correction slightly improves RLAIF’s advantage over RLHF (+3 %).

Table 10 shows RLAIF vs RLHF rising from 88 % (uncorrected) to 91 % (corrected).

RLHF’s harmlessness also improves modestly after length correction (+2 %).

Table 10 reports RLHF moving from 76 % to 78 % when lengths are aligned.

Direct RLAIF (d‑RLAIF) and same‑size RLAIF both lose ~9 % win rate after length correction versus SFT.

Table 11: Same‑size RLAIF 68 %→59 % and d‑RLAIF 74 %→65 %.

d‑RLAIF’s edge over same‑size RLAIF shrinks by 4 % after length correction.

Table 11: d‑RLAIF vs Same‑size RLAIF 60 %→56 %.

Adding AI‑generated feedback to RLHF does not improve win rates (0 % difference).

RLHF + RLAIF achieves 71 % vs RLHF’s 74 % over SFT; the gap is not statistically significant.

Qualitative Analysis and Limitations

We compare RLAIF and RLHF across several quality dimensions and expose the remaining gaps.

RLHF depends on costly human preference labels; RLAIF swaps those for labels generated by a pre‑trained LLM, scaling alignment without hurting overall performance.

Even though the two pipelines achieve comparable BLEU scores, the appendix reveals systematic quality gaps that persist in RLAIF outputs.

**Table 21.** The “Base + CoT 0-shot” prompting template for the helpful dialogue generation task. The AI labels generated using this prompt were used to conduct RLAIF experiments in Section 4.1.

**Figure 5.** A screenshot of the user interface presented to human evaluators, ultimately used to calculate win rates. Raters are shown a context and asked to rank the quality of candidate responses.

Related Work and Conclusion

We place RLAIF in context of prior RL‑based alignment work and recap its benefits.

RLHF relies on costly human preference labels; RLAIF replaces them with labels generated by a pre‑trained LLM, preserving performance while scaling alignment.

Large language models have demonstrated strong results across a wide range of NLP tasks, and reinforcement learning has become a popular tool for further improving them.

Early RL attempts on translation and summarization used automatic metrics as rewards, which often failed to capture human notions of quality.

RLHF addresses this gap by training a reward model on pairwise human comparisons and has been successfully applied to summarization, instruction following, dialogue, and question answering.

To improve stability and efficiency, recent variants such as DPO replace the RL objective with a classification loss, while RaFT leverages the reward model for rejection‑sampling fine‑tuning.

Beyond alignment, LLMs have been employed for data generation, augmentation, and self‑training, and Bai et al. introduced RLAIF, which jointly optimizes helpfulness and harmlessness using both LLM‑generated and human preferences.

Subsequent work has shown that LLMs can directly generate reward signals, further motivating the investigation of LLMs as a scalable alternative to human labelers.

Our experiments reveal that RLAIF matches RLHF’s gains on three generation benchmarks, with human judges preferring the two approaches at comparable rates.

We also demonstrate self‑improvement: RLAIF remains effective when the LLM labeler is the same size as the policy or even the identical checkpoint.

Direct‑RLAIF, which prompts the LLM labeler to produce rewards during RL, outperforms the canonical two‑stage RLAIF pipeline that first distills preferences into a separate reward model.

We further analyze how different AI‑labeling strategies impact alignment to human preferences, highlighting both strengths and trade‑offs.

Future work includes extending RLAIF to model‑based RL where both human and assistant are modeled by LLMs, and exploring AI feedback for fine‑grained credit assignment.

Acknowledgements and Impact

We acknowledge contributors and discuss the ethical implications of AI‑generated feedback.

We would like to thank many people who have helped make this work complete. We thank Chen Zhu for optimizing our LLM inference setup, Le Hou for suggesting prompt improvements and experimenting with self‑consistency, Léonard Hussenot for bringing the problem of position bias in LLMs to our attention, and Bradley Green, Ewa Dominowska, and Blaise Aguera y Arcas for supporting this research.

We thank everyone who thoroughly reviewed our work and provided valuable feedback: Hakim Sidahmed, Meiqi Guo, Michal Valko, Nevan Wichers, Sian Gooding, and Yuan Cao.

We thank Mo Azar, Daniel Guo, Andrea Michi, Nicolas Perez‑Nieves, and Marco Selvi for their contribution to developing a RLAIF training setup that directly prompts an LLM to obtain reward scores.

Finally, we thank the individuals who designed and built the RL training infrastructure used in this paper: Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Alexis Jacq, Sabela Ramos, Piotr Stanczyk, Sertan Girgin, Danila Sinopalikov, Amélie Héliou, Nikola Momchev, and Olivier Bachem.

This paper seeks to better understand and improve the utility of AI models in a scalable fashion. Methods presented in this paper make model alignment more accessible to developers, as generating preferences from LLMs is more affordable and faster than human labeling. However, the use of AI Feedback presents two ethical considerations.

Utilizing AI‑generated feedback as a source for model alignment has the potential risk of transferring biases from off‑the‑shelf LLMs to generated preferences. This in turn may result in RL‑trained policies that further amplify biases, thereby inadvertently misaligning models and potentially causing harm. Extreme caution must be exercised, especially when deploying these models in high‑stakes domains such as medicine, law, and employment, where models have the potential to significantly impact human lives in adverse ways.

Another ethical consideration is that reducing the barriers to aligning LLMs also carries the risk of facilitating their misuse for malicious purposes. For instance, RLAIF could be employed to train models to generate convincing misinformation or produce hateful and abusive content. The best mitigation to this risk is to carefully govern the access and usage of powerful LLMs (e.g., limiting “white‑box” access), to prevent bad actors from abusing them.

Training Configurations

Training, reward‑model accuracy, and evaluation details for the appendix.

Reward models (RMs) are trained until loss and accuracy plateau, typically after 2–3 epochs, using the Adafactor optimizer with learning rate $10^{-5}$. Summarization RMs use batch size 128, other tasks use batch size 32, and all RMs process up to 1152 tokens (1024 context + 128 response).

For summarization, the AI‑feedback RM is seeded from the SFT model, while the human‑feedback RM starts from the base PaLM 2 XS; initializing the human‑feedback RM from SFT lowered accuracy, as shown in Table 6.

Reinforcement‑learning (RL) policies begin from the SFT model for each task, sample with temperature $T = 0.9$, and train for 8 epochs with batch size 128, learning rate $10^{-5}$, and KL weight $\beta = 0.05$. The final checkpoint is selected by scoring four high‑reward candidates with an off‑the‑shelf LLM and confirming the best win‑rate through manual inspection.

Pairwise accuracy measures whether a reward model ranks the preferred response higher than the non‑preferred one on a held‑out human‑preference set; the binary outcome is averaged over examples.

Across all three tasks, RMs trained on human feedback achieve higher pairwise accuracy than those trained on AI feedback, reflecting the alignment of human‑derived labels with the evaluation distribution.

Despite this gap, higher RM accuracy does not guarantee better RL performance; RLAIF matches or exceeds RLHF on several tasks even when its RM accuracy is lower.

Post‑RL response formatting removes trailing periods or spaces from generated summaries to prevent reward‑hacking artifacts and keep human judges focused on content.

Human evaluation involved ~2 k unique rating instances, each paired with multiple model responses, yielding ~6 k response pairs evaluated by three raters (≈18 k ratings total).

Kendall’s coefficient of concordance $W$ ranged from 0.6 to 0.7 across sessions, indicating reasonable inter‑annotator agreement.

To control for response length, we fit a logistic regression on the length‑ratio of two policies’ outputs and query the model at a ratio of 1.0, producing a length‑adjusted win‑rate.

**Table.** Comparison of model performance based on initialization and feedback type.

**Table 6.** Results of initializing the summarization RMs on PaLM 2 XS vs. the SFT model.

**Table 7.** Accuracy values for variants of RMs trained on AI labels for the task of summarization.

**Table.** Comparison of model performance based on length-corrected and uncorrected metrics.

**Table 9.** Length-controlled win rate for the helpful dialogue generation task.

**Table 11.** Length-controlled win rate for same-size RLAIF and direct RLAIF.

The table presents a comparison of model performance metrics, specifically focusing on "Length uncorrected" and "Length corrected" percentages for three different model pairings: RLHF + RLAIF vs SFT, RLHF vs SFT, and RLHF + RLAIF vs RLHF.

**Table 13.** Length-controlled win rate for experiments combining human and AI feedback.

Cost Analysis

Provides cost estimates and win‑rate tables examining length control and self‑consistency effects.

We compute the per‑inference cost for a typical summarization request as $142$ inferences × ($830$ encoder tokens × $0.03 / 1{,}000$ tokens + $61$ decoder tokens × $0.06 / 1{,}000$ tokens) ≈ $0.06$ USD, rounding to the nearest cent.

Google Cloud charges $90$–$129$ per $1{,}000$ units, where one unit equals $50$ words for classification. Averaging the bounds and converting yields $0.1095$ USD per $50$ words, i.e., roughly $0.0022$ USD per word.

Table 12 reports length‑controlled win rates for AI labeler alignment. Using the “Base 0‑shot” RLAIF variant yields $63\%$ (uncorrected) and $67\%$ (corrected) win rates versus SFT, while the “Detailed CoT 0‑shot” variant improves to $63\%$ (uncorrected) and $45\%$ (corrected) when compared to the base RLAIF.

Table 13 extends the analysis to mixed human‑AI feedback. Combining RLHF with RLAIF achieves $71\%$ (uncorrected) and $74\%$ (corrected) win rates over SFT, whereas RLHF alone reaches $48\%$ (uncorrected) and $61\%$ (corrected). Adding RLAIF to RLHF does not further improve over RLHF alone.

We estimate the human labeling cost per example at $0.67$ USD, based on an average of $304$ words per example and a per‑$50$‑word rate of $0.11$ USD.

Read the original paper

Open the simplified reader on Paperglide