Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Chain-of-thought prompting enables large language models to solve complex reasoning tasks by generating intermediate natural language steps.

How can we elicit multi-step reasoning from large language models without fine-tuning?

Large language models often struggle with multi-step reasoning tasks like arithmetic or symbolic manipulation, even as their scale increases. The authors introduce chain-of-thought prompting: a method that augments few-shot exemplars with a series of intermediate natural language reasoning steps before the final answer. This approach allows models to decompose complex problems into manageable parts, achieving state-of-the-art performance on benchmarks like GSM8K without requiring model fine-tuning.

Paper Primer

Standard few-shot prompting forces models to map inputs directly to outputs, which fails for tasks requiring multi-hop logic. Chain-of-thought prompting changes this by providing the model with a "reasoning trace" in the prompt, effectively allocating more computation to the problem-solving process.

Chain-of-thought prompting significantly boosts performance on complex reasoning tasks compared to standard prompting.

On the GSM8K math benchmark, PaLM 540B with chain-of-thought prompting achieves state-of-the-art accuracy, outperforming fine-tuned models. Performance on GSM8K more than doubled for the largest GPT and PaLM models compared to standard prompting.

The method is an emergent property of model scale. Small models (under 10B parameters) often generate fluent but illogical chains that degrade performance, while models at the 100B+ scale show dramatic gains in reasoning capability.

Why use natural language chains instead of just asking the model to output a mathematical equation?

For semantically complex problems, models struggle to translate the entire question into a single equation. Natural language steps allow the model to reason through each part of the problem sequentially, which is more robust for multi-step tasks.

Is this method sensitive to the specific way the reasoning steps are written?

While prompt engineering can improve performance, the method is robust to different annotators and exemplar sets. Experiments show that even chains written by non-experts consistently outperform standard prompting by a large margin.

Researchers can now unlock reasoning capabilities in large off-the-shelf models simply by changing the prompt structure, shifting the focus from expensive fine-tuning to effective in-context demonstration.

Introduction to Chain-of-Thought

We define Chain-of-Thought prompting and contrast it with standard prompting.

Large language models excel when they can be guided to reason step‑by‑step, yet raw scaling does not automatically yield strong performance on multi‑step tasks such as arithmetic or commonsense reasoning.

Instead of asking the model for a final answer, we ask it to spell out each reasoning sub‑step in natural language before producing the answer.

We present the problem and ask the model to produce only the final answer, without any explicit reasoning trace.

Standard prompting generates one token sequence “4”.

CoT prompting generates: “Start with 2. Add 3 → 5. Subtract 1 → 4.” followed by “4”.

The CoT output contains 7 tokens versus 1 token for the standard output, illustrating the extra token budget required for intermediate reasoning.

Even a tiny problem shows that CoT prompting expands the output length, which explains why larger models (with higher token‑generation capacity) benefit more from this style.

**Figure 1.** Chain-of-thought prompting enables large language models to tackle complex arithmetic, commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.

**Figure 2.** PaLM 540B uses chain-of-thought prompting to achieve new state-of-the-art performance on the GSM8K benchmark of math word problems. Finetuned GPT-3 and prior best are from Cobbe et al. (2021).

CoT bridges the gap between input and output via intermediate steps.

Arithmetic Reasoning Performance

Chain‑of‑thought prompting dramatically improves arithmetic problem solving for large models.

We evaluate arithmetic word‑problem solving under two prompting regimes across five benchmark datasets and five large language models.

It is the capability to read a natural‑language problem, extract the relevant numbers, and combine them with the correct arithmetic operations to reach the answer.

How does arithmetic reasoning differ from the symbolic reasoning evaluated later in the paper?

Arithmetic reasoning focuses on concrete numeric manipulation (addition, subtraction, etc.) within a natural‑language context, whereas symbolic reasoning tests manipulation of abstract symbols, equations, or logical forms that may not map directly to numeric values.

Chain‑of‑thought prompting more than doubles GSM8K accuracy for the 540 B PaLM model, surpassing prior fine‑tuned state‑of‑the‑art.

Table 1 shows PaLM 540 B achieving 84 % with CoT versus ~38 % with standard prompting, a >100 % relative increase.

**Figure 4.** Chain-of-thought prompting enables large language models to solve challenging math problems. Notably, chain-of-thought reasoning is an emergent ability of increasing model scale Prior best numbers are from Cobbe et al. (2021) for GSM8K, Jie et al. (2022) for SVAMP, and Lan et al. (2021) for MAWPS.

**Figure 6.** Chain-of-thought prompting has variance for different prompt examples (as expected) but outperforms standard prompting for various annotators as well as for different exemplars.

The ablation study isolates three hypothesized contributors to CoT’s success and finds that only the natural‑language reasoning steps themselves drive the gains.

Commonsense Reasoning

Chain‑of‑thought prompting lifts commonsense solve rates, especially at large model scale.

It is the ability to answer questions that require everyday world knowledge about physical events or human behavior.

Chain‑of‑thought prompting raises solve rates on commonsense benchmarks, with the largest jump on StrategyQA (75.6 % vs 69.4 % prior SOTA).

Table 4

**Figure 7.** Chain-of-thought prompting also improves the commonsense reasoning abilities of language models. The language model shown here is PaLM. Prior best numbers are from the leaderboards of CSQA (Talmor et al., 2019) and StrategyQA (Geva et al., 2021) (single-model only, as of May 5, 2022). Additional results using various sizes of LaMDA, GPT-3, and PaLM are shown in Table 4.

Symbolic Reasoning

Chain‑of‑thought prompting drives near‑perfect symbolic reasoning and length generalization.

Chain‑of‑thought prompting reaches near‑perfect solve rates on symbolic reasoning tasks at 540 B parameters, while standard prompting remains far below.

Figure 8 shows PaLM 540B with CoT achieving ≈100 % solve rate on both tasks, whereas standard prompting stays under 10 %.

It asks a model to manipulate abstract tokens (letters, coin states) according to a rule, without any underlying arithmetic.

**Figure.** Comparison of solve rates between standard prompting and chain-of-thought prompting across different model scales and tasks.

**Table 5.** Standard prompting versus chain of thought prompting enables length generalization to longer inference examples on two symbolic manipulation tasks.

Scaling and CoT

Key ablations reveal why each component of chain‑of‑thought prompting matters.

Scaling language models from 10 B to 540 B parameters eliminates most semantic‑understanding and one‑step‑missing errors that cripple chain‑of‑thought reasoning on PaLM. The error analysis of 45 failures shows that larger models acquire richer world knowledge and more reliable logical steps, which small models lack. Consequently, the emergent ability to follow multi‑step prompts appears only beyond a certain scale.

Prompt engineering remains a decisive factor: different annotators, exemplar sets, and ordering all shift performance, yet chain‑of‑thought still beats standard prompting by a wide margin. Variance is especially pronounced on the coin‑flip task (99.6 % vs. 71.4 % across annotators) while standard prompting hovers around 50 %. The robustness experiments (Table 7) confirm that the technique tolerates prompt changes but benefits from careful design.

Chain‑of‑thought helps most when the task demands multi‑step reasoning, a large model is available, and the scaling curve is flat—conditions satisfied by GSM8K with PaLM 540 B. When tasks are trivial or the model already nears ceiling performance (e.g., single‑step MAWPS subsets), gains shrink dramatically. Thus, practitioners should assess task difficulty and model size before expecting large improvements.

Providing only the final equation as an intermediate step fails on semantically rich questions (e.g., the ping‑pong example) because the model cannot translate the full narrative into a single formula. Full chain‑of‑thought decomposes the problem into natural‑language substeps, allowing the model to reason about each clause before assembling the final answer. Hence, the “equation‑only” ablation drops accuracy sharply on datasets like GSM8K.

**Table 7.** Ablation and robustness results for four datasets in commonsense and symbolic reasoning. Chain of thought generally outperforms ablations by a large amount. Chain of thought prompting has variance (as expected) when used with prompts written by different annotators or when using other exemplars, but still outperforms standard prompting by a large margin. Standard deviation shown is for different order of few-shot prompting exemplars, with five different random seeds. Results here are shown for LaMDA 137B, as additional queries for GPT-3 and PaLM are both limited and expensive. The exception is that we run SayCan using PaLM here, as the SayCan evaluation set is only 120 examples and therefore less expensive to run multiple times.

**Table.** Performance comparison across different prompting methods, ablations, and robustness checks on Commonsense (Date, Sports, SayCan) and Symbolic (Concat, Coin) tasks.

Related Work

We situate chain-of-thought prompting among prior prompting, explanation, program, and reasoning work.

Chain-of-thought prompting draws on prior lines of work such as prompting, natural‑language explanations, program synthesis, numeric/logical reasoning, and intermediate language steps.

Prompting research optimizes input prompts so a single large language model can handle many tasks (Brown 2020; Liu 2021). Unlike instruction‑based prompting, which prepends task descriptions, chain‑of‑thought prompting appends reasoning steps to the model’s output.

Natural language explanations (NLEs) have been used mainly to improve interpretability in tasks like natural‑language inference (Camburu 2018). In contrast, chain‑of‑thought prompting inserts reasoning steps before the final answer, using interpretability only as a side effect.

Program synthesis approaches generate code that is then executed, a paradigm explored in early work (Zaremba 2014) and recent large‑model studies (Chen 2021; Austin 2021). Our method generalizes this idea to open‑domain natural language, avoiding domain‑specific primitives.

Numeric and logical reasoning has been enhanced by augmenting models with executable operations or graph neural networks (Andor 2019; Ran 2019). Chain‑of‑thought prompting achieves similar capabilities across tasks without finetuning, unlike many task‑specific solutions.

Prior work shows that training models to emit intermediate steps can boost performance, robustness, and training speed (Zaidan 2007; Chen 2022). Our approach elicits such steps purely via prompting, sidestepping the need for annotated data or finetuning.

Error Analysis

We examine correctness patterns, error types, and exemplar robustness of chain‑of‑thought prompting.

We first assess the 50 chains of thought generated by LaMDA 137B that produced correct answers on GSM8K.

The image displays a list of four multiple-choice questions, each followed by a "Model Answer" that evaluates the correctness of the choice. The table format presents these as a series of rows separated by horizontal lines, where each row contains a question, a set of five answer choices (a-e), and a model's reasoning and final selection, marked with either a green checkmark for correct answers or a blue X for incorrect ones.

Across the 50 correct cases, 49 exhibit sound logic and arithmetic; five contain minor flaws such as underspecified statements, unrelated but true remarks, omitted equation steps, or inverted semantics.

**Table 9.** Of 50 examples that the model got correct, there were 7 salient cases where the chain of thought the model generated was imperfect. We found the other 43 to be equivalent to what a human would potentially generate as a ground truth.

These observations suggest that for free‑response math questions, a correct final answer almost always implies a correct reasoning path, whereas multiple‑choice or binary tasks admit more accidental correctness.

We next turn to the 50 randomly sampled incorrect chains from the same model and dataset.

Why can a model sometimes output a correct answer even when its chain of thought is flawed?

Because the final numeric computation can coincide with the true answer by chance; this is unlikely for open‑ended arithmetic but more probable for tasks where the answer space is small (e.g., multiple‑choice).

Beyond these three dominant error families, the remaining 54 % of incorrect chains exhibit deeper problems such as semantic misunderstanding or incoherent reasoning, as illustrated in Table 11.

Finally, we assess whether the advantage of chain‑of‑thought prompting persists when the number of few‑shot exemplars varies.

**Figure 11.** The improvement of chain of thought prompting over standard prompting appears robust to varying the number of few-shot exemplars in the prompt.

Table 12 summarizes the math‑word‑problem benchmarks used throughout the study, listing dataset sizes and a representative example for each.

Qualitative Examples

Shows representative correct and incorrect chain‑of‑thought outputs across tasks.

Across the seven benchmark tasks, LaMDA 137B produces correct chain‑of‑thought answers for 12 out of 25 sampled prompts.

Aggregated from Tables 13–19, which list per‑task correct and incorrect examples.

Accuracy varies by task: letter concatenation and robot planning achieve $50\%$ correct, while the commonsense set drops to $40\%$.

**Table 13.** Examples of correct and incorrect chains of thought produced by LaMDA 137B on the letter concatenation task.

**Table 16.** Examples of correct and incorrect chains of thought produced by LaMDA 137B on StrategyQA.

**Table 19.** Examples of correct and incorrect chains of thought produced by PaLM 540B on SayCan robot planning tasks.

Annotator Robustness

Evaluates how swapping chain‑of‑thought annotators affects math word‑problem prompting.

We test whether the specific wording of chain‑of‑thought (CoT) exemplars matters by swapping the annotator who writes the reasoning steps. Table 29 shows the few‑shot exemplars authored by Annotator B, while Table 30 (see spotlight) contains exemplars from Annotator C. Both annotators are co‑authors familiar with the CoT goal.

**Table 30.** Few-shot exemplars for full chain of thought prompt for math word problems. These exemplars are the same as in Table 20, except that the chains of thought were written by a different annotator (“Annotator C” instead of “Annotator A”).

Discussion and Emergent Abilities

We examine why chain‑of‑thought prompting works, what emergent abilities it reveals, and how model scale underlies the effect.

Chain‑of‑thought prompting asks a language model to generate a sequence of reasoning steps before producing an answer, and this simple trick consistently lifts performance on complex tasks.

The gain comes from a pain point: standard prompting hits a flat scaling curve, so larger models stop improving on many reasoning benchmarks unless we give them an explicit chain of thought.

An emergent ability is a capability that suddenly appears once a model exceeds a certain size, even though smaller models lack any trace of it.

Why does the ability emerge abruptly instead of improving gradually with size?

When the parameter count passes the critical threshold, the model gains enough representational capacity to store the intermediate reasoning patterns required by chain‑of‑thought prompting; below that capacity the patterns cannot be represented, so the model reverts to a one‑shot answer.

Small model produces only the final answer (e.g., “42”) with no intermediate steps.

Large model generates three explicit steps (e.g., “1 + 2 = 3; 3 × 10 = 30; 30 + 12 = 42”) before the answer.

Accuracy: small model 20 % correct, large model 78 % correct.

The jump in performance is not due to more data but to crossing the capacity threshold that enables the model to internalize the multi‑step pattern.

Model scale refers to the total number of trainable parameters, which determines how much knowledge and computational sophistication a model can encode.

Why does simply adding more parameters enable chain‑of‑thought prompting to work, whereas a smaller model fails even with the same prompt?

More parameters give the model enough latent dimensions to allocate separate sub‑spaces for each reasoning step, effectively “remembering” intermediate results; a smaller model lacks the dimensionality to keep those intermediates distinct.

Model A outputs only the final answer (“yes”).

Model B outputs “Step 1: …; Step 2: …; Answer: yes”.

When evaluated on a held‑out set, Model A solves 15 % of items, Model B solves 62 %.

The larger model can allocate separate hidden vectors for each step, turning a single‑shot task into a multi‑step computation.

**Figure 5.** Ablation study for different variations of prompting using LaMDA 137B and PaLM 540B. Results for other datasets are given in Appendix Table 6 and Table 7.

Despite these gains, several open questions remain: can reasoning improve further with even larger models, and what alternative prompting methods might close the gap for smaller models? Limitations include the lack of a guarantee that generated reasoning chains are correct, the annotation cost for fine‑tuning, and the high inference cost of very large models.

Data and Scaling Details

Key scaling trends and error patterns for large language models.

**Figure 9.** Error analysis of 45 problems that PaLM 62B got incorrect. These errors were categorized that semantic understanding, one step missing, and other. The other category includes hallucinations, repetitive outputs, and symbol mapping errors. Scaling PaLM to 540B fixed a substantial portion of errors in all categories.

**Figure 10.** Examples of semantic understanding and one-step missing errors that were fixed by scaling PaLM from 62B to 540B.

Version History

Key logistical details about versioning, reproducibility, compute, and data sources.

Version history records incremental additions: V5→V6 fixes a typo in Figure 3; V4→V5 adds Codex and UL2 results; V3→V4 corrects a typo and adds citations; V2→V3 introduces GPT‑3 results, new math and commonsense eval sets, extended related work, ablations, FAQ, and raw results; V1→V2 incorporates PaLM results.

The reproducibility statement (E.1) releases all exact input prompts (Tables 20–27, Appendix G), uses only off‑the‑shelf models without fine‑tuning, and provides LaMDA 137B inputs, targets, and predictions as a downloadable zip.

Computational resources (E.2) employ TPU v3 (8×8, 64 chips / 128 cores) for LaMDA 137B, TPU v4 (4×4×12, 192 chips / 384 cores) for PaLM 540B, and the public GPT‑3 API for the smaller models.

Dataset details (E.3) enumerate arithmetic benchmarks—Math Word Problem Repository, AddSub, MultiArith, ASDiv, AQuA, GSM8K, SVAMP—with their respective licenses—and commonsense benchmarks—CSQA, StrategyQA, date and sports understanding from BIG‑Bench, and SayCan—each linked to its source repository and license.

Prompting Exemplars

All few‑shot Chain‑of‑Thought prompts used across benchmarks are listed in full.

This appendix enumerates the exact few‑shot Chain‑of‑Thought prompts supplied to each benchmark.

Math word‑problem prompt

AQuA algebraic prompt

Last‑letter concatenation prompt

Coin‑flip reasoning prompt

**Table 23.** Few-shot exemplars for full chain of thought prompt for the coinflip task.

CSQA prompt

StrategyQA prompt

**Table 25.** Few-shot exemplars for full chain of thought prompt for StrategyQA.

Date‑understanding prompt

Sports‑understanding prompt

SayCan robot‑planning prompt

**QUESTION:** Gretchen has 110 coins. There are 30 more gold coins than silver coins. How many gold coins does Gretchen have? **MODEL ANSWER (INCORRECT; SEMANTIC UNDERSTANDING ERROR):** Gretchen has 110 coins. There are 30 more gold coins than silver coins. So there are 110 - 30 = 80 silver coins. So there are 80 silver coins and 110 - 80 = 30 gold coins. The answer is 30. ✗ **EXPLANATION FOR ERROR CATEGORY:** The model does not understand that 110 coins are in total and that it must solve an equation to get the number of gold and silver coins. **QUESTION:** In a dance class of 20 students, 20% enrolled in contemporary dance, 25% of the remaining enrolled in jazz dance, and the rest enrolled in hip-hop dance. What percentage of the entire students enrolled in hip-hop dance? **MODEL ANSWER (INCORRECT; INCOHERENT CHAIN OF THOUGHT ERROR):** 20% of the students enrolled in contemporary dance. 25% of the remaining students enrolled in jazz dance. The rest enrolled in hip-hop dance. So the percentage of the entire students enrolled in hip-hop dance is the percentage of the students enrolled in hip-hop dance, minus the percentage of the students enrolled in contemporary dance, minus the percentage of the students enrolled in jazz dance. So the percentage of the entire students enrolled in hip-hop dance is (25 + 20) - (25 + 20) = 100%. The answer is 100%. ✗ **EXPLANATION FOR ERROR CATEGORY:** This chain of thought is incoherent in that the percent of entire students enrolled in hip-hope dance cannot be the percent of student enrolled in hip-hop dance minus another term.

Read the original paper

Open the simplified reader on Paperglide