Self-Consistency Improves Chain-of-Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou

Self-consistency improves language model reasoning by sampling diverse reasoning paths and selecting the most frequent answer.

How can we improve the reasoning accuracy of large language models on complex tasks without retraining them?

Language models often fail at complex reasoning tasks because standard greedy decoding forces the model to commit to a single, potentially flawed, chain-of-thought path. Self-consistency replaces this with a "self-ensemble" strategy: it samples multiple diverse reasoning paths for a single question and selects the final answer via majority vote. This approach consistently achieves state-of-the-art performance across arithmetic and commonsense benchmarks, often by double-digit percentage margins.

Paper Primer

The core move is to treat reasoning as a process that can be explored through multiple paths. By sampling from the decoder, the model generates several candidate solutions; because correct reasoning paths tend to converge on the same answer while incorrect ones diverge, the most frequent answer is statistically likely to be correct.

Self-consistency significantly boosts accuracy on multi-step reasoning tasks compared to standard chain-of-thought prompting.

Performance gains were observed across four language models (UL2, GPT-3, LaMDA, PaLM) on benchmarks including GSM8K, SVAMP, and AQuA. Achieved absolute accuracy gains of +17.9% on GSM8K and +12.2% on AQuA using PaLM-540B.

Unlike prior methods that require training auxiliary verifiers or collecting human annotations, self-consistency is entirely unsupervised and works off-the-shelf with existing pre-trained models.

Why does this approach work better than simply training a model to be more accurate?

The authors hypothesize that complex reasoning problems admit multiple valid paths to a unique correct answer. By sampling, the model explores this space, and the majority vote effectively filters out the noise of individual, flawed reasoning steps.

What is the scope of this method—does it apply to all text generation?

Currently, it is limited to tasks with a fixed answer set where consistency can be measured by frequency. Extending it to open-ended generation would require defining a new metric for consistency between disparate outputs.

Researchers can now achieve state-of-the-art reasoning performance without additional training or fine-tuning by simply shifting from greedy decoding to a majority-vote sampling strategy.

Motivation and Problem Framing

We expose why greedy decoding in chain‑of‑thought prompting fails and set up the self‑consistency fix.

Chain‑of‑thought prompting asks a model to articulate intermediate reasoning steps, but when paired with greedy decoding it yields a single, deterministic answer that can be wrong. Because only the highest‑probability token is chosen at each step, any early mistake propagates and cannot be corrected.

The model is prompted to generate a short reasoning chain before the final answer, mimicking how a person works through a problem step by step.

At every generation step the model selects the token with the highest probability, producing a single deterministic output.

Step 1 (greedy): “There are 2 red balls.” – correct.

Step 2 (greedy): “There are 5 balls total.” – correct.

Step 3 (greedy): “Probability = 2/4 = 0.5.” – the model mistakenly divides by 4 instead of 5.

Final answer (greedy): “0.5.” – wrong because the single erroneous token in the fraction cannot be revised.

Greedy decoding provides only one chain; a single slip in any intermediate computation irrevocably corrupts the final answer.

Q: John found that the average of 15 numbers is 40. If 10 is added to each number then the mean of the numbers is? Answer Choices: (a) 50 (b) 45 (c) 65 (d) 78 (e) 64 A: If 10 is added to each number, then the mean of the numbers also increases by 10. So the new mean would be 50. The answer is (a). Q: If a / b = 3/4 and 8a + 5b = 22, then find the value of a. Answer Choices: (a) 1/2 (b) 3/2 (c) 5/2 (d) 4/2 (e) 7/2 A: If a / b = 3/4, then b = 4a / 3. So 8a + 5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22, which means 44a / 3 = 22. So a is equal to 3/2. The answer is (b). Q: A person is traveling at 20 km/hr and reached his destiny in 2.5 hr then find the distance? Answer Choices: (a) 53 km (b) 55 km (c) 52 km (d) 60 km (e) 50 km A: The distance that the person traveled would have been 20 km/hr * 2.5 hrs = 50 km. The answer is (e). Q: How many keystrokes are needed to type the numbers from 1 to 500? Answer Choices: (a) 1156 (b) 1392 (c) 1480 (d) 1562 (e) 1788 A: There are 9 one-digit numbers from 1 to 9. There are 90 two-digit numbers from 10 to 99. There are 401 three-digit numbers from 100 to 500. 9 + 90(2) + 401(3) = 1392. The answer is (b).

Greedy decoding makes chain‑of‑thought reasoning fragile: a single token error determines the final answer.

The Self-Consistency Strategy

Self-consistency samples multiple reasoning chains and selects the most frequent answer to improve robustness.

Greedy decoding often yields a single reasoning chain that can go wrong, leaving the model without a fallback. By generating several chains we can let the model “vote” on the answer, reducing the impact of any single mistake.

Think of the model as a panel of experts: each sampled reasoning path is an expert’s solution, and we adopt the answer that most experts agree on. This consensus mitigates the brittleness of a single greedy chain.

Count occurrences: $18$ appears 3 times, $14$ appears 2 times.

The majority vote selects $18$ as the final answer because it has the highest count.

If a tie had occurred, the model would fall back to the answer with higher conditional probability $P(r_i,a_i\mid\text{prompt})$.

Even when some sampled paths are incorrect, the correct answer often dominates the vote, yielding a more reliable result than any single greedy decode.

Self‑consistency decoding – sample $m$ paths and vote on the answer.

How does self‑consistency differ from the usual greedy decoding used in chain‑of‑thought prompting?

Greedy decoding picks the single most probable token at each step, producing one reasoning chain; self‑consistency instead samples many chains, then aggregates their final answers, so the decision is based on consensus rather than a single deterministic path.

Empirical Results

Self-consistency consistently outperforms greedy CoT across all models and benchmarks.

Self‑consistency improves accuracy over greedy CoT by up to +23% absolute.

Table 2 shows GPT‑3 Code‑davinci‑001 gains 23% on MultiArith compared to greedy CoT.

**Figure 2.** Self-consistency (blue) significantly improves accuracy over CoT-prompting with greedy decoding (orange) across arithmetic and commonsense reasoning tasks, over LaMDA-137B. Sampling a higher number of diverse reasoning paths consistently improves reasoning accuracy.

Self‑consistency consistently outperforms greedy CoT across all tested models and benchmarks.

Robustness and Ablations

Ablation results confirm self‑consistency stays effective across sampling, prompts, and ensembling variations.

We first test whether the self‑consistency trick survives changes in the sampling temperature $T$ and top‑$k$ cutoff $k$.

Varying $T$ and $k$ does not degrade self‑consistency accuracy; performance stays within 0 % of its peak.

Figure 6 shows flat accuracy curves across a wide range of $T$ and $k$ on LaMDA‑137B.

**Figure 6.** GSM8K accuracy over LaMDA-137B. Self-consistency works under various sampling strategies and sampling parameters.

Next we compare a single greedy path to the multi‑path self‑consistency pipeline on two large models.

Self‑consistency outperforms greedy decoding by up to +18 % absolute accuracy on both LaMDA‑137B and PaLM‑540B.

Figures 7 (LaMDA‑137B) and 8 (PaLM‑540B) report greedy ≈ 57 % versus self‑consistency ≈ 75 %.

**Figure 7.** Self-consistency (blue) significantly improves accuracy across various arithmetic and commonsense reasoning tasks, over LaMDA-137B. Sampling a higher number of diverse reasoning paths consistently improves reasoning accuracy.

Qualitative examples show that diverse sampled reasoning paths can correct errors made by greedy decoding.

**Table 12.** Additional examples where self-consistency helps repair the errors over greedy decode on LaMDA-137B. Two sampled reasoning paths that are consistent with the ground truth are shown.

We also examine sensitivity to the wording of the chain‑of‑thought prompt.

Across three manually crafted prompt sets, self‑consistency adds a consistent +17 % gain over standard CoT.

Table 9 reports self‑consistency accuracies of 73.9 %, 74.2 %, and 74.5 % versus CoT scores 56.5 %–57.1 %.

**Table 9.** GSM8K accuracy over PaLM-540B. The results show robustness of self-consistency with respect to different prompts in the input.

We compare self‑consistency to a conventional multi‑model ensemble that averages predictions from three distinct LMs.

**Table 10.** Comparison of GSM8K accuracy over multiple-model ensembles.

Finally, we test whether self‑consistency can be combined with other ensembling tricks.

Adding 40 diverse prompts to self‑consistency raises accuracy to 75.4 %, a +1 % improvement over self‑consistency alone.

Table 11 lists 74.4 % for pure self‑consistency and 75.4 % for self‑consistency + different prompts (×40).

**Table 11.** Combining self-consistency with other ensembling strategies.

Self‑consistency remains robust to sampling temperature, top‑$k$ choices, prompt variations, and even outperforms multi‑model ensembles.

Related Work and Discussion

Self-consistency samples multiple reasoning paths and picks the most frequent answer, improving robustness over greedy chain‑of‑thought prompting.

Research on language‑model reasoning has explored two broad families: (1) prompting techniques that ask the model to generate intermediate steps, and (2) decoding strategies that diversify the generated text. The former includes chain‑of‑thought prompting, while the latter spans temperature, top‑k, nucleus, and minimum Bayes‑risk decoding, as well as explicit re‑ranking pipelines.

A prompting style where the model first produces a sequence of reasoning steps before emitting the final answer.

Randomly draws tokens from the softmax distribution scaled by a temperature parameter, controlling randomness.

Restricts sampling to the k most probable tokens at each step, discarding the tail of the distribution.

Samples from the smallest set of tokens whose cumulative probability exceeds a threshold p.

Chooses the output that minimizes expected loss under a model‑defined risk function, typically using multiple samples.

Generates several candidate answers, then trains a separate model (or uses human annotations) to score and select the best.

Models can produce contradictory statements across turns, explanations, or factual queries.

Generate several complete answers, then return the answer that appears most often among them.

Our experiments show that self‑consistency, which builds on Sample‑and‑Rank, yields sizable accuracy gains on arithmetic and commonsense benchmarks across four model scales. Beyond raw performance, the approach supplies rationales, offers a simple uncertainty estimate (answer frequency), and improves calibration. The main trade‑off is extra compute: sampling a handful of paths (≈5–10) recovers most of the benefit while keeping cost modest. Future work could use self‑consistency‑generated data to fine‑tune a model for single‑pass inference, and further research is needed to curb occasional nonsensical reasoning paths.

Implementation Details

Resource configurations, inference timings, and complete few‑shot prompt sets for all evaluated models.

UL2 runs on a TPU v3 (2×2 configuration, 4 chips, 8 cores); LaMDA‑137B uses a TPU v3 (8×8, 64 chips, 128 cores); PaLM‑540B runs on a TPU v4 (4×4×12, 192 chips, 384 cores). GPT‑3 models are accessed via the public API.

Inference jobs typically finish in 1–4 hours for UL2 and LaMDA‑137B (≈1 000 examples per task) and in 2–12 hours for PaLM‑540B; the longest tasks, such as commonsense reasoning, stay under two days.

For all GPT‑3 variants we cap generation at 128 tokens, apply no frequency or presence penalties, and stop parsing when the next “Q:” token appears, matching the prompting format used throughout.

The appendix lists raw greedy‑decode outputs alongside two sampled reasoning paths for each of MultiArith, SVAMP, AQuA, CommonsenseQA, ARC‑easy, ARC‑challenge, and GSM8K, illustrating the variability that self‑consistency later aggregates.

Tables 14–18 and 20–21 provide the complete few‑shot exemplars for AQUA‑RAT, ARC, HotpotQA, arithmetic reasoning, ANLI, RTE, and BoolQ, respectively; each entry shows the prompt question, answer choices, and the model‑generated rationale.

**Table 19.** Few-shot exemplars for e-SNLI (the rationales are crowd-sourced annotations from the original dataset).

**Figure 8.** Self-consistency (blue) significantly improves accuracy across various arithmetic and commonsense reasoning tasks, over PaLM-540B. Sampling a higher number of diverse reasoning paths consistently helps reasoning accuracy.

Read the original paper

Open the simplified reader on Paperglide