Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe

InstructGPT aligns large language models with user intent using reinforcement learning from human feedback.

How can we align large language models to follow user instructions more reliably using human feedback?

Large language models are often misaligned with user intent, frequently generating toxic, untruthful, or unhelpful content because their training objective—predicting the next token on the internet—differs from following instructions. The authors align these models by fine-tuning them in three stages: supervised learning on human-written demonstrations, training a reward model on human-ranked outputs, and optimizing the policy against that reward model using reinforcement learning. This approach produces models that are significantly preferred by human labelers over the original, much larger models, while simultaneously reducing hallucination and toxicity.

Paper Primer

The core mechanism is Reinforcement Learning from Human Feedback (RLHF). By training a reward model to predict human preferences and then using that model to guide the language model's policy, the authors shift the model's behavior from simple text completion to instruction following.

InstructGPT models are preferred by human labelers over the original GPT-3 models, even when the InstructGPT models are significantly smaller.

In head-to-head comparisons, the 1.3B parameter InstructGPT model is preferred to the 175B parameter GPT-3 model. The 175B InstructGPT model is preferred to the 175B GPT-3 model 85% of the time.

The alignment process improves truthfulness and reduces toxicity without significantly degrading performance on standard NLP benchmarks.

On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers about twice as often as GPT-3, and hallucination rates on closed-domain tasks are halved. 21% hallucination rate for InstructGPT vs. 41% for GPT-3.

Why use reinforcement learning instead of just supervised fine-tuning on demonstrations?

Supervised fine-tuning is limited by the quality and quantity of human-written demonstrations. Reinforcement learning allows the model to learn from a much larger set of comparisons, enabling it to optimize for human preferences beyond what can be explicitly demonstrated.

Does this alignment process hurt the model's ability to perform general tasks?

The authors observed an "alignment tax" where performance on some public NLP datasets decreased. They mitigated this by mixing the reinforcement learning updates with updates that increase the likelihood of the original pretraining distribution.

Fine-tuning with human feedback is a highly cost-effective way to align language models, often yielding better user-perceived performance than simply scaling model size.

Introduction and Motivation

We expose why next‑token prediction fails to follow instructions and propose RLHF fine‑tuning to close the gap.

Standard language‑model pre‑training optimizes next‑token likelihood on massive web corpora, which does not guarantee that the model will obey a user’s explicit instruction. Consequently, deployed models can hallucinate facts, emit toxic language, or simply ignore the prompt.

We want models to act according to the user’s stated goal—be helpful, truthful, and harmless—rather than merely continue the statistical patterns of internet text.

Compute the query, key, and value projections for each of the 4 tokens (8 × 8 matrix multiplies).

Form the 4 × 4 dot‑product matrix (16 multiplications).

Apply softmax row‑wise, yielding a dense attention distribution.

Multiply the distribution by the value matrix to obtain the attended representations.

Even for a 4‑token context the attention step dominates memory and compute; scaling the context to 2 048 tokens (typical for GPT‑3) would require roughly 200 GB of memory, far beyond a single GPU.

**Figure 43.** Model samples on a prompt cherry-picked to show instruction following behavior in other languages, along with random samples from the GPT-3 175B and InstructGPT 175B models. In this Swedish example, InstructGPT follows the instruction but writes the output mostly in English.

This table compares the responses of GPT-3 175B and InstructGPT 175B to a specific prompt. **Prompt:** Serendipity means the occurrence and development of events by chance in a happy or beneficial way. Use the word in a sentence. **Labeler demonstration:** Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity. **GPT-3 175B completion:** - Serendipity is the ability to see something good in something bad. Use the word in a sentence. - Serendipity means to be happy with what you have. Use the word in a sentence. - Serendipity means finding something good in something that is bad. Use the word in a sentence. **InstructGPT 175B completion:** - Serendipity can be defined as the happy chance occurrence of events leading to a beneficial outcome. For example, when Lori first moved to the city and started her job search, it was serendipity that she stumbled upon the perfect opportunity and was able to start her career in marketing with a company she truly loves.

The essential problem is the mismatch between next‑token prediction and genuine instruction following.

Prior Work in Alignment

We situate our RLHF approach among prior alignment, instruction‑following, and safety work.

We build on a broad literature that uses reinforcement learning from human feedback (RLHF) to align language models, and on extensive work addressing instruction following, safety, and bias mitigation.

GPT‑3 is a large, autoregressive language model trained on massive internet text; it serves as a strong zero‑shot performer but often ignores explicit user instructions.

FLAN and T0 are instruction‑tuned language models that have been fine‑tuned on a mixture of many NLP tasks framed as natural‑language instructions.

Early work applying RL with human‑generated reward signals to train agents in simulated environments and Atari games.

Uses human‑annotated turn‑level preferences to train a reward model that guides dialogue generation.

Collects bilingual human judgments to shape a reward model that improves translation quality via RL.

Leverages human preferences over parsed logical forms to train a reward model for semantic parsing.

Collects human ratings of story coherence and uses them to train a reward model for RL fine‑tuning.

Uses human preference data over product reviews to shape a reward model that guides generation.

Trains a reward model on human judgments of extracted evidence relevance for QA tasks.

Augments GPT‑3 prompts with written human feedback, yielding measurable gains on downstream tasks.

Applies reinforcement learning with a normative prior to align agents operating in text‑based environments.

Systematically enumerates harmful and unintended behaviors exhibited by large language models.

Proposes using conversational assistants to study alignment, presenting simple baselines and scaling analyses.

Studies that fine‑tune language models on many NLP datasets framed as natural‑language instructions.

Trains agents to execute natural‑language navigation commands in simulated 3‑D environments.

Comprehensive review of bias, privacy leakage, misinformation, and malicious use risks posed by large language models.

Introduces evaluation suites that quantify harmful language generation across multiple dimensions.

Fine‑tunes language models on a small, curated dataset reflecting desired values, improving adherence on a QA task.

Removes high‑risk documents from the pretraining corpus, yielding less toxic generations at a small performance cost.

Combines data filtering, word‑blocking, safety tokens, and human‑in‑the‑loop collection to improve chatbot safety.

Applies regularization penalties to word embeddings to reduce gender and racial bias.

Generates synthetic toxic examples to train models to recognize and avoid harmful content.

Projects hidden representations onto a subspace that equalizes probabilities of sensitive tokens.

Explores loss formulations that penalize unsafe token probabilities during fine‑tuning.

Analyzes how internal activations mediate harmful generation, offering diagnostic tools for safety research.

Uses a smaller “steering” model to re‑rank or filter the outputs of a larger generator.

The RLHF Pipeline

We detail the three‑stage RLHF pipeline and its KL‑penalty safeguard.

Instruction‑following models routinely ignore user intent; our method stitches together supervised fine‑tuning, a learned reward model, and a KL‑regularized policy update to keep the model on‑task.

Collect human demonstrations on the prompt distribution and fine‑tune a pretrained language model (SFT).

Gather pairwise comparisons of model outputs, train a reward model to predict human preference (RM).

Use PPO to optimize the policy against the reward model while penalizing divergence from the SFT policy (KL penalty).

Repeat steps 2–3, continually improving the reward model and policy.

We teach the model the exact behavior we want by showing it input‑output pairs written by humans.

How does SFT differ from simply prompting GPT‑3 with few‑shot examples?

SFT updates the model’s parameters on a large set of demonstrations, so the behavior becomes part of the model itself, whereas few‑shot prompting only conditions the model at inference time without changing its weights.

We learn a scalar function that predicts which of two model outputs a human would prefer.

Why not train the reward model directly on absolute human ratings instead of pairwise comparisons?

Absolute ratings are noisy and scale‑dependent; pairwise preferences provide a consistent ordinal signal that the softmax loss can exploit without needing calibrated scores.

PPO updates the policy to increase expected reward while staying close to the current policy, preventing destructive jumps.

How does PPO’s clipping differ from the KL‑penalty we add?

Clipping limits how much the probability ratio can change in a single update, while the KL‑penalty adds a continuous cost for deviating from the SFT distribution across the whole trajectory.

Compute KL: $0.8\cdot\log(0.8/0.6) + 0.2\cdot\log(0.2/0.4) \approx 0.8\cdot0.287 + 0.2\cdot(-0.693) \approx 0.230 - 0.139 = 0.091$.

With $\beta = 0.5$ the penalty contribution is $0.5 \times 0.091 \approx 0.045$.

The total PPO objective adds this penalty to the reward term, tempering the incentive to over‑favor the 0.8 probability.

The KL term prevents the policy from collapsing onto a single high‑probability completion, preserving diversity and the underlying language knowledge.

**Figure 2.** A diagram illustrating the three steps of our method: (1) supervised fine-tuning (SFT), (2) reward model (RM) training, and (3) reinforcement learning via proximal policy optimization (PPO) on this reward model. Blue arrows indicate that this data is used to train one of our models. In Step 2, boxes A-D are samples from our models that get ranked by labelers. See Section 3 for more details on our method.

By alternating reward‑model training and KL‑regularized PPO, we obtain models that follow instructions more reliably without sacrificing the broad knowledge encoded during pretraining.

Experimental Results

InstructGPT models markedly outperform GPT‑3 baselines on instruction‑following tasks.

Large language models often ignore user instructions; RLHF fine‑tunes them to prioritize intent without losing general language ability.

Human labelers prefer the 175 B InstructGPT output to the 175 B GPT‑3 output 85 ± 3 % of the time.

Win‑rate measured on a held‑out API prompt set; error bars are 95 % confidence intervals.

Reward models achieve 69.6 ± 0.9 % accuracy predicting held‑out labeler preferences.

Five‑fold cross‑validation across labeler groups; each model trained on four groups and evaluated on the fifth.

In a head‑to‑head comparison, InstructGPT beats the FLAN model 78 ± 4 % of the time.

Human win‑rate on the API prompt distribution; FLAN was fine‑tuned on public NLP tasks.

When RLHF fine‑tuning pushes a model toward instruction following, its performance on many public NLP benchmarks drops—a cost we call the alignment tax.

How does the alignment tax differ from a usual performance trade‑off?

Typical trade‑offs arise from limited model capacity; the alignment tax is a systematic degradation caused by the RLHF objective itself—optimizing for human‑preferred instruction following actively pushes the model away from the distribution of benchmark tasks.

**Figure 1.** Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals.

**Figure 3.** Preference results of our models, measured by winrate against the 175B SFT model. Left: results on prompts submitted to GPT models on the API; Right: results on prompts submitted to InstructGPT models on the API; Top: results from held-out labelers; Bottom: results from training labelers. We omit GPT (prompted) from the evals on prompts submitted to GPT-3 models (left) as these prompts are already designed to perform well for GPT-3, as opposed to prompts submitted to InstructGPT models (right).

Discussion and Implications

We reflect on the broader implications, limitations, and open challenges of RLHF‑based alignment.

Section 5.1 frames our work as part of an iterative alignment program that improves current models rather than speculating about future superhuman systems.

We find that the compute cost of alignment (≈ 65 peta‑FLOP·days for the PPO‑ptx model) is a tiny fraction of the ≈ 3,640 peta‑FLOP·days spent pretraining GPT‑3, yet RLHF yields larger helpfulness gains than a 100× increase in model size.

InstructGPT also generalizes instruction following to non‑English languages and code‑related tasks, hinting that a single fine‑tuning stage can cover many downstream domains without exhaustive supervision.

Most performance degradations observed after fine‑tuning can be mitigated, so the “alignment tax”—the extra cost of making a model safe—remains low for RLHF.

Section 5.2 asks explicitly who the models are being aligned to, breaking the alignment signal into three stakeholder groups: labelers, researchers, and API customers.

Our labelers—mostly English‑speaking contractors from the United States and Southeast Asia—agree on about 73 % of examples, revealing both useful consensus and substantial disagreement.

Researchers write the labeling instructions, so the data also reflects our own organizational preferences; customers further bias the data by submitting prompts that they deem valuable for their products.

Section 5.3 enumerates methodological limitations: a small, non‑representative contractor pool, single‑label annotations for most comparisons, and the fact that models still produce toxic, biased, or fabricated content.

Even when a user explicitly asks for harmful output, the model complies, underscoring that instruction following alone is insufficient for safety.

Section 5.4 lists open research directions: adversarial labeling to surface worst‑case behavior, filtering pretraining data for toxicity, and combining RLHF with truthfulness or refusal mechanisms.

Further work could integrate RLHF with steerability tools such as control codes or inference‑time sampling adjustments, and explore alternative training algorithms like expert iteration, behavior cloning, or constrained optimization.

We also note that incorporating more pretraining data into RLHF does not fully eliminate the alignment tax and may re‑introduce undesirable behaviors present in the raw corpus.

Section 5.5 discusses broader impacts: RLHF makes models more helpful, truthful, and harmless, but also lowers the barrier for malicious use such as large‑scale misinformation.

Deploying these models raises policy trade‑offs: open‑sourcing enables wide access but hampers misuse control, whereas API‑only access centralizes power and limits transparency.

Ultimately, a transparent, accountable alignment pipeline that can represent diverse stakeholder preferences is essential for ensuring the net societal benefit of increasingly capable language models.

Prompt Data Collection

Appendix A details how prompts were gathered and shows representative user requests.

Section A.1 describes how the authors bootstrapped prompts for the first InstructGPT model.

Labelers were asked to write three kinds of prompts: “Plain” (arbitrary diverse tasks), “Few‑shot” (an instruction plus K query‑response pairs), and “User‑based” (prompts mirroring real API use‑cases).

Because no instruction‑following model existed yet, contractors created the initial prompts themselves.

To protect application confidentiality, a separate labeler rewrote the tasks into vague high‑level descriptions, stripping any identifying details.

These anonymized prompts formed the supervised‑learning dataset that trained the first InstructGPT model, which was released in beta early 2021.

Section A.2 explains that the bulk of later training data came from prompts submitted by users in the OpenAI API Playground.

Each submission was accompanied by an alert informing users that their prompts could be used for future model training, and the authors filtered out any personally identifiable information.

To ensure diversity, prompts were deduplicated by long common prefixes and capped at roughly 200 per organization; splits were created by organization ID so validation data contains distinct use‑cases.

The authors grouped API requests into ten high‑level use‑case categories: generation, open QA, closed QA, brainstorming, chat, rewriting, summarization, classification, extraction, and other.

Section A.2.1 provides fictional but realistic prompt examples for each category, illustrating the breadth of tasks the model must handle.

Section A.2.2 lists analogous illustrative prompts drawn from the GPT‑3 distribution, noting that some of those prompts have ambiguous user intent.

Prompt Examples

Appendix details the sizes and annotation breakdowns of the SFT, RM, and PPO datasets.

Section A.3 reports how many prompts were collected for each training stage. The numbers illustrate why the RL fine‑tuning step (PPO) can rely on a much larger pool of human‑generated outputs than the earlier supervised stage.

For SFT we generated many more labeler‑written prompts than customer‑written ones because early in the project labelers supplied a template instruction plus few‑shot examples, which we then recombined to create multiple SFT datapoints.

**Table.** Examples of different use cases for AI models.

Data Diversity

Appendix A provides detailed statistics on prompt diversity, length distributions, and language composition of the training data.

The collected dataset spans a broad spectrum of categories and use cases, as illustrated by the categorical breakdown in Table 1 and the analogous distribution observed for the PPO datasets.

**Table A.2.2.** Illustrative user prompts from GPT-3 distribution

The table lists various use cases and corresponding examples of prompts or tasks. The columns are "Use Case" and "Example".

**Table.** Examples of various use cases for AI models.

The table lists various "Use Case" categories and their corresponding "Example" prompts or interactions.

We classified the language of every instruction using the lightweight tool langid.py; roughly 96 % of the 110 k datapoints are identified as English, though the true proportion is likely closer to 99 % because of classifier error.

A small minority of prompts appear in at least twenty other languages—including Spanish, French, German, Chinese, Japanese, and many others—demonstrating the dataset’s modest multilingual coverage.

Human Labeler Details

Details on human data collection, labeler selection, instructions, demographics, and interface.

We recruited labelers through Upwork and Scale AI, then screened them for sensitivity to harmful content. Candidates were evaluated on four criteria: agreement on sensitive‑speech flagging, agreement on quality rankings, performance on a Likert‑scaled sensitive‑demonstration set, and self‑reported comfort identifying sensitive speech across demographic groups. Soft cutoffs of 75 % agreement and a 6/7 demonstration score guided the final selection.

Labeling instructions evolved during the project: early on we asked labelers to prioritize helpfulness, later we shifted to prioritize truthfulness and harmlessness for final evaluations. The change reflects a research focus on refusing unsafe instructions, which introduces a configurable refusal behavior at inference time. Tables 10 and 11 show the exact prompts used for the final‑evaluation and RealToxicityPrompts distributions.

A voluntary, anonymous survey of the 19 labelers revealed a young, gender‑balanced cohort, with 75 % under 35 years old and a mix of US and Southeast‑Asian nationalities. The demographic breakdown (age, gender, ethnicity, education) is summarized in Table 12.

We also asked labelers to rate their experience; the results (Table 13) show high satisfaction, fair compensation, and appreciation for clear communication, though a minority found the task repetitive.

The labeling UI presented the task description, model completions, and rating fields in a single screen, allowing labelers to submit helpfulness, truthfulness, and harmlessness scores efficiently.

**Table A.2.1.** Illustrative user prompts from InstructGPT distribution

This table lists various use cases for language models and provides corresponding examples of prompts or tasks for each. The table includes categories such as brainstorming, classification, extraction, and generation.

**Table 6.** Dataset sizes, in terms of number of prompts.

The table displays data statistics for three categories: SFT Data, RM Data, and PPO Data. Each category is broken down by split (train/valid), source (labeler/customer), and size (number of samples).

Model Architectures

Appendix C details training configurations and data handling for all model stages.

All model architectures follow the GPT‑3 design, and for reward and value models the original unembedding layer is swapped for a scalar projection head.

Training runs in fp16 precision with fp32 master copies of the weights; the same byte‑pair encoding as the original GPT‑3 is used throughout.

Each model processes up to 2 k tokens; prompts longer than 1 k tokens are discarded and generated responses are limited to 1 k tokens.

Optimization uses Adam with $\\beta_1=0.9$ and $\\beta_2=0.95$ for all stages.

C.1 SFT training runs for 16 epochs with residual dropout 0.2, a cosine learning‑rate schedule that decays to 10 % of the peak, and model‑size‑specific peak learning rates ($9.65\\times10^{-6}$ for 1.3 B/6 B, $5.03\\times10^{-6}$ for 175 B) and batch sizes (32 for 1.3 B/6 B, 8 for 175 B). Learning rates and epoch counts were chosen by geometric search, and final checkpoints were selected based on the reward‑model (RM) score rather than validation loss.

C.2 RM training employs a single 6 B reward model for all PPO experiments; it is initialized from a 6 B GPT‑3 fine‑tuned on a suite of public NLP datasets and trained for one epoch at $9\\times10^{-6}$ with a cosine schedule to 10 % and batch size 64. The authors observed that performance is stable across a wide LR range but degrades quickly when training for more than one epoch.

C.3 RLHF initialization starts from a pretrained GPT‑3 model, applies two epochs of supervised fine‑tuning with a 10 % mix of original pretraining data, and uses a cosine schedule that decays to 10 % of the peak LR. Batch sizes are 32 for the 1.3 B and 6 B models and 8 for the 175 B model; peak LRs were selected via a log‑linear sweep (5 values for the smaller models, 3 for the largest), resulting in $5\\times10^{-6}$, $1.04\\times10^{-5}$, and $2.45\\times10^{-6}$ respectively.

C.4 RLHF training uses the SFT‑initialized policies, computes a KL‑divergence penalty with $\\beta=0.02$, and runs for 256 k episodes covering roughly 31 k unique prompts. Each iteration processes a batch of 512 samples split into eight minibatches of 64; a constant learning rate with a 10‑step warmup (starting at one‑tenth peak) is applied, and exponential moving averages of the weights decay at 0.992. PPO settings include a clip ratio of 0.2 and rollout temperature of 1, while the value‑function learning rate is fixed at $9\\times10^{-6}$ (1.3 B/6 B) or $5\\times10^{-6}$ (175 B). To counteract regressions on public NLP benchmarks, pretraining gradients are mixed in with coefficient $\\gamma=27.8$.

C.5 FLAN and T0 baselines are obtained by fine‑tuning a 175 B GPT‑3 model on the respective datasets; T0 is subsampled to 1 M examples to match FLAN’s data volume. Both use a cosine learning‑rate schedule with peak LRs of 4e‑6 or 6e‑6, batch size 64, and training proceeds until the reward‑model score saturates (after 896 k examples for FLAN and the larger‑batch T0 experiment).

Read the original paper

Open the simplified reader on Paperglide