Let’s Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

Process supervision significantly outperforms outcome supervision for training reliable multi-step reasoning models.

Does providing feedback on each step of a reasoning process (process supervision) lead to more reliable models than providing feedback only on the final answer (outcome supervision)?

Large language models often produce logical errors during multi-step reasoning, and outcome-based feedback is too coarse to reliably catch these mistakes. The authors train Process-supervised Reward Models (PRMs) that evaluate the correctness of each individual reasoning step, rather than just the final answer. On the challenging MATH dataset, this approach achieves a 78.2% solve rate, significantly outperforming outcome-supervised models and majority voting.

Paper Primer

The core mechanism hinges on shifting the reward signal from the final result to the intermediate steps. By training the model to predict the correctness of each step, the system provides precise feedback that identifies the exact location of logical errors.

Process supervision is more data-efficient and effective than outcome supervision.

Active learning, which prioritizes labeling convincing but incorrect solutions, improves data efficiency by 2.6×. The PRM solves 78.2% of problems in a representative MATH test subset, consistently outperforming outcome-supervised models as the number of sampled solutions increases.

The authors release PRM800K, a dataset containing 800,000 step-level human feedback labels, to facilitate further research into reliable reward modeling.

Why does process supervision outperform outcome supervision for complex reasoning?

Outcome supervision forces the model to perform difficult credit assignment, as it must infer where a solution failed based only on the final result. Process supervision provides a richer signal by confirming correct steps and pinpointing the specific location of the first error.

Does this method introduce an "alignment tax" that reduces model performance?

No; the authors observe a negative alignment tax. Process supervision improves both the reliability of the reasoning and the final performance, suggesting it is a highly effective alignment strategy for reasoning tasks.

Researchers should prioritize process-level feedback over outcome-based metrics when training models for multi-step reasoning, as it directly mitigates hallucinations and improves credit assignment.

Introduction

We frame why supervising reasoning steps, not just final answers, is crucial for reliable LLMs.

Large language models can now generate multi‑step chain‑of‑thought solutions, yet they frequently hallucinate or follow faulty logic, turning a single mistake into a completely wrong answer.

We need a training signal that rewards *how* a model reasons, not just *what* answer it eventually produces.

Step 1 label: “5 × 4 = 20” – correct.

Step 2 label: “20 + 7 = 27” – correct.

Step 3 label: “27 ÷ 2 = 13.5” – incorrect (should be 13.5, but suppose the model outputs 14).

Process supervision triples the annotation cost for this example, but it also reveals exactly where the model went wrong, enabling more efficient learning than a single “wrong answer” label.

Earlier work (Uesato et al., 2022) found outcome and process supervision to be comparable on elementary math, but that comparison used a weaker base model and far fewer human labels.

Our study scales up both the model (a state‑of‑the‑art transformer) and the feedback budget (800 K step‑level labels), and evaluates on the far more demanding MATH benchmark, revealing a clear advantage for process supervision.

Let $x^8 + 3x^4 - 4 = p_1(x)p_2(x) \cdots p_k(x)$, where each non-constant polynomial $p_i(x)$ is monic with integer coefficients, and cannot be factored further over the integers. Compute $p_1(1) + p_2(1) + \cdots + p_k(1)$. 1. I notice that the given polynomial has even degree and only even powers of $x$, so I can try to make a substitution to simplify it. 2. Let $y = x^4$, then the polynomial becomes $y^2 + 3y - 4$, which is a quadratic equation. 3. I can factor this quadratic equation as $(y + 4)(y - 1)$, so the original polynomial is $(x^4 + 4)(x^4 - 1)$. 4. Now I need to factor each of these factors further, if possible, over the integers. 5. For the first factor, $x^4 + 4$, I recall the Sophie Germain identity, which says that $a^4 + 4b^4 = (a^2 + 2b^2 + 2ab)(a^2 + 2b^2 - 2ab)$. 6. Applying this identity with $a = x$ and $b = 1$, I get $x^4 + 4 = (x^2 + 2x + 2)(x^2 - 2x + 2)$. 7. For the second factor, $x^4 - 1$, I can use the difference of squares formula, which says that $a^2 - b^2 = (a + b)(a - b)$. 8. Applying this formula with $a = x^2$ and $b = 1$, I get $x^4 - 1 = (x^2 + 1)(x^2 - 1)$. 9. Then I can apply the difference of squares formula again with $a = x$ and $b = 1$, to get $x^2 - 1 = (x + 1)(x - 1)$. 10. So the original polynomial is $(x^4 + 4)(x^4 - 1) = (x^2 + 2x + 2)(x^2 - 2x + 2)(x^2 + 1)(x + 1)(x - 1)$. 11. These are all the irreducible factors over the integers, so $p_1(x) = x^2 + 2x + 2$, $p_2(x) = x^2 - 2x + 2$, $p_3(x) = x^2 + 1$, $p_4(x) = x + 1$, and $p_5(x) = x - 1$. 12. To find the sum of their values at $x = 1$, I just need to plug in $x = 1$ into each factor and add them up. 13. I get $p_1(1) = 1^2 + 2 \cdot 1 + 2 = 5$, $p_2(1) = 1^2 - 2 \cdot 1 + 2 = 1$, $p_3(1) = 1^2 + 1 = 2$, $p_4(1) = 1 + 1 = 2$, and $p_5(1) = 1 - 1 = 0$. 14. So the final answer is \$5 + 1 + 2 + 2 + 0 = 10$. **Answer:** 10

A class of 30 students recently took a test. If 20 students scored 80, 8 students scored 90, and 2 students scored 100, then what was the class average (the mean) on this test? To find the class average, I need to add up all the scores and divide by the number of students. I can use a shortcut to add up the scores, since they are grouped by value. For example, instead of adding 80 + 80 + ... + 80 twenty times, I can just multiply 80 by 20 to get the same result. Similarly, I can multiply 90 by 8 and 100 by 2 to get the sums of those groups. So, the total sum of the scores is 80 * 20 + 90 * 8 + 100 * 2. I can use the distributive property of multiplication to simplify this expression. It is equivalent to 80 * (20 + 1) + 90 * 8 + 100. That is, 80 * 21 + 720 + 100. I can do the arithmetic in my head or use a calculator. 80 * 21 is 1680, so the total sum is 1680 + 720 + 100, which is 2500. Now, to find the average, I need to divide 2500 by 30, the number of students. I can use long division or a calculator for this step. 2500 divided by 30 is 83.33, repeating. This means that the class average is 83.33, or 83 and one-third. Answer: 83.33

Find the unique odd integer $t$ such that $0 < t < 23$ and $t + 2$ is the inverse of $t$ modulo 23. 1. I need to find an odd integer $t$ between 0 and 23 such that $t + 2$ is the inverse of $t$ modulo 23. 2. That means that $t$ and $t + 2$ multiply to give 1 modulo 23. 3. I can write this as $t(t + 2) \equiv 1 \pmod{23}$. 4. I can try to guess and check values of $t$ until I find one that works, but that might take a long time. 5. I can also try to use some properties of modular arithmetic to simplify the equation. 6. For example, I can subtract 23 from $t + 2$ and still get the same remainder modulo 23. 7. So I can write $t(t + 2) \equiv t(t - 21) \pmod{23}$. 8. This looks simpler, because now I have a difference of two multiples of $t$. 9. I can factor out $t$ and get $t(t - 21) \equiv t(1 - 21) \pmod{23}$. 10. Now I can divide both sides by $t$, as long as $t$ is not a multiple of 23. 11. This gives me $t - 21 \equiv 1 - 21 \pmod{23}$. 12. I can simplify this further by adding 21 to both sides. 13. This gives me $t \equiv 1 \pmod{23}$. 14. This means that $t$ has the same remainder as 1 when divided by 23. 15. The smallest positive value of $t$ that satisfies this is $t = 1$. 16. This is also an odd integer, so it meets the criteria of the problem. 17. I can check that $t = 1$ works by plugging it into the original equation. 18. I get $1(1 + 2) \equiv 1 \pmod{23}$, which is true. 19. Therefore, the unique odd integer $t$ that satisfies the given conditions is $t = 1$. 20. Answer: 1

I need to find a value of $x$ that makes the ratio of Denali's pay to Nate's pay the same in two different scenarios. I can assume that the pay per dog is constant, so the ratio of their pay is the same as the ratio of the number of dogs they walk. Let me write an equation for the first scenario, where Denali walks $4x$ more dogs and Nate stays at 12 dogs. The ratio of Denali's dogs to Nate's dogs is $\frac{16+4x}{12}$. Now let me write an equation for the second scenario, where $x$ of Nate's dogs are reassigned to Denali. The ratio of Denali's dogs to Nate's dogs is $\frac{16+x}{12-x}$. I want these two ratios to be equal, so I can set them equal to each other and solve for $x$. I get $\frac{16+4x}{12} = \frac{16+x}{12-x}$. I can cross-multiply and simplify this equation. I get $(16 + 4x)(12 - x) = (16 + x)(12)$. Expanding the brackets, I get \$192 - 16x + 48x - 4x^2 = 192 + 12x$. Subtracting 192 from both sides, I get $-16x + 48x - 4x^2 = 12x$. Rearranging the terms, I get $-4x^2 + 20x = 12x$. Subtracting $12x$ from both sides, I get $-4x^2 + 8x = 0$. I can factor out $-4x$ from both terms, and get $-4x(x - 2) = 0$. This means that either $-4x = 0$ or $x - 2 = 0$. Solving for $x$, I get $x = 0$ or $x = 2$. However, the problem says that $x \neq 0$, so I can ignore that solution. Therefore, the only valid solution is $x = 2$. **Answer: 2**

The core gap we address is that correct answers do not guarantee correct reasoning.

Process vs. Outcome Supervision

We detail step‑level feedback collection and reward‑model training for whole‑solution and per‑step scoring.

We compare outcome‑supervised and process‑supervised reward models by training them on the same generator outputs and evaluating their ability to pick the correct solution from a uniform sample.

ORM treats each generated solution as a single item and learns to predict whether the final answer is correct.

Why can outcome supervision miss reasoning errors?

ORM only observes the final answer; if a solution arrives at the right result via a faulty derivation, the binary label is “correct,” so the model receives no signal to penalize the mistaken reasoning.

PRM is like a spell‑checker that flags each step as you write, rather than only checking the finished document.

How does PRM differ from a standard token‑level language‑model loss?

Standard LM loss treats every token as equally important for next‑token prediction, whereas PRM explicitly predicts a binary correctness token after each *step* and aggregates those predictions multiplicatively to score the whole solution.

Step 1 probability $0.9$ → contributes $0.9$ to the product.

Step 2 probability $0.4$ → product becomes $0.9 \times 0.4 = 0.36$.

Step 3 probability $0.8$ → final score $0.36 \times 0.8 = 0.288$.

The overall score drops sharply because the second step is likely wrong; the product formulation automatically surfaces the first problematic step.

We run experiments in two regimes. In the large‑scale regime we fine‑tune GPT‑4‑class models and train the most capable ORM and PRM. In the small‑scale regime we train cheaper models and use the large‑scale model to generate supervision, enabling ablations that would be too expensive otherwise.

**Figure 1.** A screenshot of the interface used to collect feedback for each step in a solution.

**Figure 2.** Two solutions to the same problem, graded by the PRM. The solution on the left is correct while the solution on the right is incorrect. A green background indicates a high PRM score, and a red background indicates a low score. The PRM correctly identifies the mistake in the incorrect solution.

Human labelers annotate step‑by‑step solutions from the large‑scale generator, producing the PRM800K dataset (800 K step labels from 75 K solutions to 12 K problems). We preferentially surface “convincing wrong‑answer” solutions—those that the current PRM rates highly but end with an incorrect final answer.

Generate $N$ solutions per problem with the fixed generator.

Score each solution with the current PRM.

Select the top $K$ solutions that have high PRM scores yet wrong final answers.

Present the selected solutions to human labelers for step‑level feedback (positive/negative/neutral).

Aggregate the new labels into PRM800K and retrain the PRM.

Repeat the iteration several times.

The MATH benchmark provides math problems with automatically checkable final answers, enabling large‑scale outcome supervision.

PRM800K is a collection of step‑level correctness labels harvested from human annotators, focused on the most misleading generator outputs.

Large-Scale Supervision Results

Small‑scale tests show process supervision beats outcome supervision and majority voting across data budgets.

Process‑supervised reward model attains higher best‑of‑1860 accuracy than outcome‑supervised and majority‑voting baselines.

Figure 3 reports 78.2 % for PRM versus 72.4 % for ORM and 69.6 % for majority voting.

**Figure 3.** A comparison of outcome-supervised and process-supervised reward models, evaluated by their ability to search over many test solutions. Majority voting is shown as a strong baseline. For $N \leq 1000$, we visualize the variance across many subsamples of the 1860 solutions we generated in total per problem.

We sample $1$–$200$ solutions per problem from a small‑scale generator and train three reward models on identical datasets: (i) process supervision from PRMlarge, (ii) outcome supervision from PRMlarge, and (iii) outcome supervision via final‑answer checking.

**Figure b.** Three reward models trained on 200 samples/problem using different forms of supervision, compared across many test-time compute budgets.

Active learning uses a selector model (PRMselector) trained on a single sample per problem to rank $1000$ candidates; we then train larger reward models on the top $80\%$ wrong‑answer and $20\%$ remaining samples.

**Figure a.** Four series of reward models trained using different data collection strategies, compared across training sets of varying sizes.

Iteratively retraining PRMselector after each labeling round caused instability and did not improve downstream reward models.

**Table 1.** We measure out-of-distribution generalization using recent STEM tests. We evaluate the outcome-supervised RM, the process-supervised RM, and majority voting using 100 test samples per problem.

Out-of-Distribution Generalization

Process supervision improves out‑of‑distribution reasoning and eases credit assignment.

The paper’s central claim is that supervising each reasoning step (process supervision) yields more accurate and logically sound answers than supervising only the final answer (outcome supervision).

OOD Generalization measures whether a model’s reasoning ability transfers to problems that differ from its training distribution.

The PRM identifies that the first three algebraic manipulations are correct (high step scores) but flags a sign error in the fourth step (low score).

The ORM receives a single negative label for the whole solution, offering no clue about where the error occurred.

Using the PRM’s step‑wise feedback, a downstream correction module rewrites the fourth step, raising the final answer accuracy from 0 % to 78 % on this example.

Step‑level feedback lets the system pinpoint and repair mistakes, whereas a single outcome label leaves the error location ambiguous.

Credit assignment is dramatically easier under process supervision because the model receives a dense signal indicating which intermediate steps are correct, rather than a sparse binary label on the final answer.

From an AI‑alignment perspective, process supervision yields more interpretable chains of thought and avoids the “alignment tax” that often accompanies outcome‑only methods, which can sacrifice safety for raw performance.

Potential test‑set contamination is mitigated by aggressive string‑matching removal of MATH problems from the training mix, yet subtle rephrasings may remain; however, any residual leakage would affect all supervision regimes equally, preserving the relative advantage of process supervision.

**Figure 5.** Two histograms comparing the distribution of problem difficulty levels and subjects in both the original MATH test set and in our 500 problem test subset.

Related Work

We situate our work among prior studies on supervision and reasoning.

Uesato et al. (2022) compared outcome and process supervision on grade‑school math, finding similar final‑answer error rates while process supervision required less data. Our work differs in three ways: we use a stronger model for data collection, evaluate on the more challenging MATH benchmark, and gather a larger quantity of process‑supervision data.

Gao et al. (2022) employed a large reward model to supervise smaller models, addressing over‑optimization in RLHF by substituting a gold‑standard reward model for human feedback. Our approach mirrors this by using a large‑scale reward model to supervise smaller reward models, but we focus on process supervision rather than pure outcome signals.

Recent work on natural‑language reasoning includes Lewkowycz et al. (2022) who improved MATH performance via extensive technical finetuning, Wang et al. (2022) who showed self‑consistency yields strong results without extra finetuning, and several studies (Wei 2022; Nye 2021) demonstrating that explicit intermediate reasoning steps—chain‑of‑thought or scratchpad—are crucial. Kojima et al. (2022) further revealed that zero‑shot prompting can elicit such stepwise reasoning.

MathMix Dataset

Details of the MathMix pretraining corpus and its construction.

Appendix A describes the construction of MathMix, a curated 1.5 B‑token corpus used for a lightweight pretraining stage before fine‑tuning on smaller math datasets.

MathMix is a high‑quality, math‑focused token collection that supplies the language model with abundant problem‑solution pairs, discussion text, and synthetic examples, while deliberately omitting generic natural‑language material.

**Table 2.** MathMix dataset components.

When training smaller models (Section 4) we employ a 1 B‑token subset that excludes the critique‑grading data; large‑model experiments run for roughly three billion tokens (two epochs). Small models are trained for six epochs, amounting to about 6.6 B tokens.

We run decontamination checks against the MATH test split by stripping LaTeX and searching for matching n‑grams, but we cannot guarantee that all leakage has been eliminated.

PRM800K Dataset

Dataset construction details and composition statistics for PRM800K.

We describe how the PRM800K dataset was assembled, the two collection phases, and the resulting composition.

**Table 3.** Distribution of positive/negative steps/solutions.

Before Phase 2, labelers completed 30 screening questions and were admitted only if they matched gold labels on at least 75 % of the steps. During Phase 2 we injected 10–20 additional quality‑control problems per generation, using their outcomes to prune underperforming labelers and to create instructional material on common mistakes.

Evaluation and ORM Training

Details on evaluation splits, labeling protocol, and outcome‑supervised model training.

To avoid over‑fitting on the 7,500 MATH training problems we enlarged the training pool with 4,500 problems from the MATH test split and evaluate exclusively on the remaining 500 held‑out problems, which were chosen uniformly at random.

Figure 5 confirms that this 500‑problem subset mirrors the full MATH test set in difficulty levels and subject distribution; the exact split is available at the provided GitHub link.

Labelers examined each reasoning step and marked it as positive, negative, or neutral: a neutral step is appropriate, correct, and easily verifiable; a positive step is neutral and also advances the solution; all other steps are negative.

Labelers were given only the ground‑truth final answer—not reference solutions—to prevent bias toward a particular solution path while still allowing them to resolve misunderstandings using the correct answer.

In phase 1, labelers could insert their own steps when every candidate was negative, which often produced long chains of neutral steps that made little progress or required constant supervision.

Phase 2 instead pre‑generates complete solutions and terminates the task as soon as the first negative step appears, streamlining the labeling process.

Outcome‑supervised reward models (ORMs) are trained like token‑level verifiers (Cobbe et al., 2021) but with a single epoch, no dropout, and no auxiliary language‑modeling loss; performance proved robust to reasonable hyperparameter variations.

Training samples are drawn uniformly from the generator at temperature 1.0 without rebalancing positives or negatives; the model predicts a score for every token, using the same target for all tokens based on whether the whole solution is correct.

At inference time the score of the final token serves as the overall solution score, matching the evaluation protocol of the original token‑level verifiers.

PRM Training Details

Implementation specifics for training and scoring Process Reward Models.

We fine‑tune the MathMix backbone to predict the probability of positive, negative, and neutral labels for each solution prefix that ends in a labeled step. Hyperparameters are swept on the first ≈ 10 % of PRM800K, and we observe that a low learning rate is essential for stable convergence when shifting from language modeling to this classification task.

All PRMs are trained for exactly two epochs. On the smallest datasets (phase 1 and early phase 2 generations) two epochs improve final accuracy compared with a single pass; on larger datasets the gain diminishes, but we keep the schedule for consistency across experiments.

Scoring a full solution proceeds by first computing a step‑level score: the model’s probability that the step’s label is positive. Two design decisions follow. First, we map neutral labels either to positive or to negative. Second, we aggregate step scores across the solution by either taking their product or their minimum.

**Table.** Comparison of performance metrics for different neutral label mappings using product and minimum aggregation methods.

We evaluate PRM and ORM across the five difficulty quintiles of the MATH benchmark. PRM’s accuracy remains high in every quintile, while ORM’s performance drops noticeably on the easiest problems, where adversarial samples can fool it. Increasing the number of generator samples benefits the hardest quintile the most, as expected for complex reasoning tasks.

For synthetic supervision we employ the large‑scale PRM (PRMlarge) to label steps of a candidate solution. Any step receiving a negative label with probability > 20 % is deemed incorrect; this threshold reflects PRMlarge’s slight over‑confidence toward positive labels. Process supervision returns the sequence of step labels up to the first negative, mimicking our human‑annotation pipeline. Outcome supervision marks a solution correct only if every step passes the same threshold.

PRM Visualizations

Visual examples illustrate PRM’s ability to spot correct and incorrect reasoning steps.

**Figure 6.** A breakdown of ORM vs PRM performance by problem difficulty.

True Positive examples (Problems 1‑4) are rare solutions that the generator produces (pass‑rates 0.1 %–4.5 %); the PRM correctly flags the full chain‑of‑thought as valid despite the difficulty of selecting the right identities.

True Negative examples (Problems 5‑8) illustrate common failure modes—misapplied algebraic formulas, incorrect simplifications, or flawed justifications—and the PRM reliably rejects these incorrect steps.

False Positive examples (Problems 9‑12) expose cases where the PRM is fooled, typically by subtle counting or algebraic oversights that look plausible but contain hidden mistakes.

Read the original paper

Open the simplified reader on Paperglide