Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Scaling language models to 175 billion parameters enables competitive few-shot performance without weight updates.
Can scaling up language models to 175 billion parameters enable them to perform diverse NLP tasks via in-context learning without any gradient-based fine-tuning?
Current language models require thousands of task-specific examples and gradient-based fine-tuning to achieve high performance, limiting their practical utility and adaptability. GPT-3 addresses this by scaling a transformer-based autoregressive model to 175 billion parameters, allowing it to perform tasks via "in-context learning"—conditioning on natural language instructions and a few demonstrations within the model's context window. This approach achieves competitive or state-of-the-art results on numerous benchmarks without any weight updates, demonstrating that model capacity alone can drive rapid task adaptation.
Paper Primer
The core mechanism is in-context learning: the model treats the input prompt as a sequence of examples, using its pre-trained pattern-recognition abilities to predict the completion of a new, unseen instance. By scaling the model size to 175 billion parameters, the authors show that the model can absorb a vast range of skills during pre-training, which it then accesses at inference time through simple text-based conditioning.
GPT-3 achieves state-of-the-art performance on several closed-book question answering tasks without fine-tuning.
On the TriviaQA dataset, the few-shot model achieves 71.2% accuracy, outperforming previous systems that relied on fine-tuning and external retrieval mechanisms. A 3.2% improvement over the previous state-of-the-art for open-domain systems.
The model demonstrates significant proficiency in rapid adaptation tasks, such as unscrambling words, performing multi-digit arithmetic, and using novel words in sentences after a single definition. These results suggest that larger models are more effective meta-learners, with the performance gap between zero-shot and few-shot settings widening as model capacity increases.
Why is this approach preferred over the standard fine-tuning paradigm?
Fine-tuning requires large, task-specific datasets for every new application, which is often impractical. Furthermore, fine-tuned models are prone to exploiting spurious correlations in narrow training distributions, whereas in-context learning allows for greater fluidity and generality across tasks.
What is the scope of this model's capabilities?
GPT-3 excels at tasks involving factual knowledge, translation, and pattern completion, but it still struggles with certain reading comprehension datasets and natural language inference tasks like ANLI, where performance remains modest.
Researchers can now treat language models as general-purpose task solvers by simply formatting prompts, shifting the focus from collecting massive labeled datasets to optimizing in-context demonstration strategies.
Introduction to Few-Shot Learning
Large language models can learn new tasks from a few examples without gradient updates.
Pre‑training followed by task‑specific fine‑tuning has driven recent NLP progress, yet it still demands thousands of labeled examples per task.
The model treats the prompt as a short program: it reads a natural‑language instruction plus a few examples, then continues generating outputs that obey the demonstrated pattern.
Fine‑tuning: collect $10{,}000$ examples, run gradient descent for many epochs → model parameters change.
In‑context: embed $5$ examples in the prompt (≈ 50 tokens) and run a single forward pass → no weight updates.
Result: the in‑context approach requires orders of magnitude less data and computation at test time.
The contrast highlights why eliminating gradient updates matters: data collection and compute cost drop dramatically while still retaining useful performance.
Only a handful of task demonstrations (typically $5$–$100$) are provided in the prompt; the model must infer the underlying rule from these few examples.
The model receives only a textual instruction (no examples) and must perform the task solely from its pre‑trained knowledge.
Traditional adaptation: update the model’s weights on a task‑specific labeled dataset using gradient descent.
**Figure 1.1:** Language model meta-learning. During unsupervised pre-training, a language model develops a broad set of skills and pattern recognition abilities. It then uses these abilities at inference time to rapidly adapt to or recognize the desired task. We use the term “in-context learning” to describe the inner loop of this process, which occurs within the forward-pass upon each sequence. The sequences in this diagram are not intended to be representative of the data a model would see during pre-training, but are intended to show that there are sometimes repeated sub-tasks embedded within a single sequence.
**Figure 2.1.** Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, one-, and few-shot, which we study in this work, require the model to perform the task with only forward passes at test time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task descriptions, examples and prompts can be found in Appendix G.
Despite strong gains, GPT‑3 still struggles on several benchmarks (e.g., ANLI, RACE, QuAC) and can be affected by data contamination from web‑scale pre‑training corpora.
The key shift is moving from gradient‑based adaptation (fine‑tuning) to in‑context conditioning, enabling task flexibility with only a few textual examples.
Model Architecture and Training
We define the evaluation spectrum and the scaling‑law principle that predicts performance across model sizes.
Fine‑tuning delivers strong benchmark scores but forces a new large labeled dataset for every task and can overfit to spurious patterns, making it costly and brittle for task‑agnostic use.
Model loss improves predictably as a smooth power‑law function of model size, data quantity, and compute, so we can extrapolate performance without training every intermediate scale.
Compute N·D: 2 M × 10 B = 2 × 10^{16}; 4 M × 10 B = 4 × 10^{16}; 8 M × 10 B = 8 × 10^{16}.
Raise each product to the –0.07 power: (2 × 10^{16})^{‑0.07} ≈ 0.71, (4 × 10^{16})^{‑0.07} ≈ 0.64, (8 × 10^{16})^{‑0.07} ≈ 0.58.
Multiply by the prefactor 0.5: losses ≈ 0.36, 0.32, 0.29 respectively.
Doubling N reduces loss by roughly 10 % each step, matching the observed smooth decay.
The example shows that even a modest increase in parameters yields a predictable loss drop, confirming why a few trained points can forecast the whole scaling curve.
Why does a simple power‑law capture loss across orders of magnitude rather than a more complex curve?
Because the dominant source of stochastic gradient noise scales with the square root of the batch‑size‑adjusted parameter count; when both model and data grow proportionally, the noise term shrinks as a power of the total compute, yielding a smooth power‑law decay.
Our transformer layers interleave standard dense attention with locally banded sparse attention, a pattern borrowed from the Sparse Transformer, which reduces quadratic cost while preserving long‑range connectivity.
**Figure 1.2:** Larger models make increasingly efficient use of in-context information. We show in-context learning performance on a simple task requiring the model to remove random symbols from a word, both with and without a natural language task description (see Sec. 3.9.2). The steeper “in-context learning curves” for large models demonstrate improved improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range of tasks.
**Table 2.1.** Sizes, architectures, and learning hyper-parameters (batch size in tokens and learning rate) of the models which we trained. All models were trained for a total of 300 billion tokens.
**Table 2.2:** Datasets used to train GPT-3. “Weight in training mix” refers to the fraction of examples during training that are drawn from a given dataset, which we intentionally do not make proportional to the size of the dataset. As a result, when we train for 300 billion tokens, some datasets are seen up to 3.4 times during training while other datasets are seen less than once.
**Figure 2.2:** Total compute used during training. Based on the analysis in Scaling Laws For Neural Language Models [KMH+20] we train much larger models on many fewer tokens than is typical. As a consequence, although GPT-3 3B is almost 10x larger than RoBERTa-Large (355M params), both models took roughly 50 petaflop/s-days of compute during pre-training. Methodology for these calculations can be found in Appendix D.
Performance Scaling Results
Key scaling trends and benchmark gains demonstrate GPT‑3’s broad in‑context abilities.
Validation loss follows a power‑law $L = 2.57\cdot C^{-0.048}$ across three decades of compute.
Figure 3.1 (scaling of performance with compute) shows the fitted curve and the data points.
**Figure 3.1:** Smooth scaling of performance with compute. Performance (measured in terms of cross-entropy validation loss) follows a power-law trend with the amount of compute used for training. The power-law behavior observed in [KMH+20] continues for an additional two orders of magnitude with only small deviations from the predicted curve. For this figure, we exclude embedding parameters from compute and parameter counts.
**Figure 1.3.** Aggregate performance for all 42 accuracy-denominated benchmarks. While zero-shot performance improves steadily with model size, few-shot performance increases more rapidly, demonstrating that larger models are more proficient at in-context learning. See Figure 3.8 for a more detailed analysis on SuperGLUE, a standard NLP benchmark suite.
On the Penn Tree Bank, the 175 B‑parameter model achieves a zero‑shot perplexity of 20.50, a 15‑point improvement over the previous state of the art.
LAMBADA zero‑shot accuracy reaches 76 % and few‑shot accuracy 86.4 %, an 8 % absolute gain over the prior best and an 18 % boost in the few‑shot regime.
HellaSwag scores 78.1 % in one‑shot and 79.3 % in few‑shot, surpassing a fine‑tuned 1.5 B model (75.4 %) while still trailing the overall SOTA (85.6 %).
StoryCloze zero‑shot accuracy is 83.2 % and few‑shot 87.7 %, narrowing the 4.1 % gap to the fine‑tuned BERT‑based SOTA.
Closed‑book QA results: TriviaQA reaches 71.2 % few‑shot (up 3.2 % over one‑shot), WebQuestions climbs to 41.5 % few‑shot, and Natural Questions to 29.9 % few‑shot, each scaling smoothly with model size.
**Figure 3.2:** On LAMBADA, the few-shot capability of language models results in a strong boost to accuracy. GPT-3 2.7B outperforms the SOTA 17B parameter Turing-NLG [Tur20] in this setting, and GPT-3 175B advances the state of the art by 18%. Note zero-shot uses a different format from one-shot and few-shot as described in the text.
**Figure 3.3:** On TriviaQA GPT3's performance grows smoothly with model size, suggesting that language models continue to absorb knowledge as their capacity increases. One-shot and few-shot performance make significant gains over zero-shot behavior, matching and exceeding the performance of the SOTA fine-tuned open-domain model, RAG [LPP+20]
**Figure 3.4:** Few-shot translation performance on 6 language pairs as model capacity increases. There is a consistent trend of improvement across all datasets as the model scales, and as well as tendency for translation into English to be stronger than translation from English.
**Figure 3.5:** Zero-, one-, and few-shot performance on the adversarial Winogrande dataset as model capacity scales. Scaling is relatively smooth with the gains to few-shot learning increasing with model size, and few-shot GPT-3 175B is competitive with a fine-tuned RoBERTa-large.
Performance follows power‑law scaling with compute.
Task-Specific Performance
Synthetic tasks reveal GPT‑3’s perfect arithmetic on two‑digit addition and limited human‑detection ability.
We evaluate GPT‑3 on a suite of synthetic benchmarks designed to probe arithmetic, lexical manipulation, analogy reasoning, and text generation.
GPT‑3 attains perfect 100 % few‑shot accuracy on two‑digit addition, out‑performing all prior language models.
Few‑shot evaluation on 2,000 random instances yields 100 % correct answers (Table 3.10).
**Figure 3.10:** Results on all 10 arithmetic tasks in the few-shot settings for models of different sizes. There is a significant jump from the second largest model (GPT-3 13B) to the largest model (GPT-3 175), with the latter being able to reliably accurate 2 digit arithmetic, usually accurate 3 digit arithmetic, and correct answers a significant fraction of the time on 4-5 digit arithmetic, 2 digit multiplication, and compound operations. Results for one-shot and zero-shot are shown in the appendix.
**Figure 3.11:** Few-shot performance on the five word scrambling tasks for different sizes of model. There is generally a smooth improvement with model size although the random insertion task shows an upward slope of improvement with the 175B model solving the task the majority of the time. Scaling of one-shot and zero-shot performance is shown in the appendix. All tasks are done with $K = 100$.
**Figure 3.12:** Zero-, one-, and few-shot performance on SAT analogy tasks, for different sizes of model. The largest model achieves 65% accuracy in the few-shot setting, and also demonstrates significant gains to in-context learning which are not present in smaller models.
**Figure 3.16.** Representative GPT-3 completions for the few-shot task of using a new word in a sentence. Boldface is GPT-3's completions, plain text is human prompts. In the first example both the prompt and the completion are provided by a human; this then serves as conditioning for subsequent examples where GPT-3 receives successive additional prompts and provides the completions. Nothing task-specific is provided to GPT-3 other than the conditioning shown here.
Data Contamination Analysis
We examine how benchmark overlap can inflate results and how to detect it.
Large language models learn to solve new tasks by conditioning on a few examples (in‑context learning) instead of updating weights.
When a benchmark’s test examples already appear in the pre‑training corpus, the model can simply recall them instead of solving the task.
**Figure 4.1:** GPT-3 Training Curves We measure model performance during training on a deduplicated validation split of our training distribution. Though there is some gap between training and validation performance, the gap grows only minimally with model size and training time, suggesting that most of the gap comes from a difference in difficulty rather than overfitting.
**Figure 4.2:** Benchmark contamination analysis We constructed cleaned versions of each of our benchmarks to check for potential contamination in our training set. The x-axis is a conservative lower bound for how much of the dataset is known with high confidence to be clean, and the y-axis shows the difference in performance when evaluating only on the verified clean subset. Performance on most benchmarks changed negligibly, but some were flagged for further review. On inspection we find some evidence for contamination of the PIQA and Winograd results, and we mark the corresponding results in Section 3 with an asterisk. We find no evidence that other benchmarks are affected.
Overall, the authors find that while a non‑trivial fraction of benchmark examples are flagged as overlapping, the resulting performance differences are typically negligible, indicating that GPT‑3’s large data scale mitigates severe memorization.
Methodological Limitations
We examine overlap metric pitfalls and compute estimates, highlighting their limitations.
Our overlap metric flags many false positives on datasets that contain background information (e.g., SQuAD) or on examples shorter than eight words, which we otherwise filter out. This inflates the “dirty” count without reflecting true memorization. Consequently, the metric can be misleading for such corpora.
DROP illustrates a failure mode: 94 % of its examples are marked dirty, yet the required information resides in the provided passage, not in the question‑answer pair. Our audit confirmed that matching training documents contain only the source passage, never the questions or answers. The modest performance drop is therefore attributed to a slight distribution shift in the remaining 6 % of examples.
When estimating total training compute we make three simplifying assumptions. First, we ignore the attention operation because it accounts for less than ten percent of the overall compute for the models we study. Second, for encoder‑decoder architectures such as T5 only half of the parameters are active per token. Third, each active parameter performs one addition and one multiplication in the forward pass, and we multiply the resulting forward cost by three to capture the backward pass.
**Table.** Performance metrics across various datasets, comparing total, dirty, and clean data splits, including accuracy/F1/BLEU scores and the relative difference between clean and overall performance.
**Table D.1.** Starting from the right hand side and moving left, we begin with the number of training tokens that each model was trained with. Next we note that since T5 uses an encoder-decoder model, only half of the parameters are active for each token during a forward or backwards pass. We then note that each token is involved in a single addition and a single multiply for each active parameter in the forward pass (ignoring attention). Then we add a multiplier of 3x to account for the backwards pass (as computing both $\frac{\partial params}{\partial loss}$ and $\frac{\partial acts}{\partial loss}$ use a similar amount of compute as the forwards pass. Combining the previous two numbers, we get the total flops per parameter per token. We multiply this value by the total training tokens and the total parameters to yield the number of total flops used during training. We report both flops and petaflop/s-day (each of which are 8.64e+19 flops).
**Table E.1:** Participant details and article lengths for each experiment to evaluate human detection of ~ 200 word model generated news articles. Participants were excluded due to internet check fails.
**Figure E.1:** Participants spend more time trying to identify whether each news article is machine generated as model size increases. Duration on the control model is indicated with the dashed line. Line of best fit is a linear model on a log scale with 95% confidence intervals.
**Table E.2:** Participant details and article lengths for the experiments investigating human detection of ~ 500 word model generated news articles. Participants were excluded due to internet check fails.
Model Limitations
This section outlines performance gaps and scaling limits revealed by the evaluation table.
The comprehensive scores in Table H.1 expose clear performance gaps: even the 175 B model still falls short on several reasoning benchmarks, and gains diminish for many tasks as model size grows.
**Figure G.49.** Formatted dataset example for Arithmetic 4D+
**Figure G.50:** Formatted dataset example for Arithmetic 5D-
Across zero‑, one‑, and few‑shot regimes the table reveals that tasks requiring deeper world knowledge (e.g., Winograd, ARC‑Challenge) improve only modestly, indicating a ceiling that scaling alone does not overcome.
Related Work
Prior studies on large language models and prompting are surveyed.
Brown et al. (2020) introduced GPT‑3, a 175 B‑parameter model that popularized prompting large language models with example inputs.
Kaplan et al. (2020) and Hoffmann et al. (2022) derived empirical scaling laws that relate model size, data quantity, and compute budget to downstream performance.
Prompt‑engineering techniques such as chain‑of‑thought (Wei et al., 2022) and self‑consistency (Wang et al., 2022) aim to improve reasoning reliability without updating model weights.
Parameter‑efficient fine‑tuning methods, including adapters (Houlsby et al., 2019) and LoRA (Hu et al., 2021), add lightweight modules to pretrained models as alternatives to pure prompting.
Analyses of memorization (Carlini et al., 2022) and task contamination (Lee et al., 2022) have highlighted privacy and evaluation concerns for large pretrained models.
The community’s In‑Context Learning paradigm continues to drive research on few‑shot adaptation and model scaling.
Task Formatting Details
Shows how each benchmark task is formatted as context‑question‑answer pairs.
This meta section catalogs the concrete textual layouts used for every benchmark in the study, ranging from multiple‑choice QA to simple arithmetic prompts.
Each benchmark entry is a plain‑text snippet that starts with a “Context” line, follows with a question, and then marks the correct answer and one or more distractors.
Read the context line and store it as the background paragraph.
Identify the question line (“Neither?”) and note that the answer is a three‑way true/false/neither choice.
Parse the “Correct Answer” line and record “Neither” as the gold label.
Parse the “Incorrect Answers” line, split on commas, and store “True” and “False” as distractors.
Emit the four lines in the exact order required by the task‑formatting specification.
This example shows how a binary‑style QA is encoded with an explicit “Neither” option, which many models mishandle if the prompt assumes only true/false.
Present the instruction line verbatim.
Show the scrambled token “asinoc” followed by an equals sign.
Place the target word “casino” after the equals sign, matching the required “Target Completion →” prefix.
Terminate the snippet with a newline so the model sees a single self‑contained example.
The format pairs a deterministic transformation (letter shuffle) with its solution, enabling a pure language‑modeling test that does not rely on external knowledge.
**Figure G.2.** Formatted dataset example for ANLI R2
**Figure G.3.** Formatted dataset example for RACE-m. When predicting, we normalize by the unconditional probability of each answer as described in 2.
**Figure G.4.** Formatted dataset example for PIQA
**Figure G.7.** Formatted dataset example for ANLI R1
**Figure G.8:** Formatted dataset example for OpenBookQA. When predicting, we normalize by the unconditional probability of each answer as described in 2.
The table presents a context-based multiple-choice question. The **Context** describes a scene involving a woman and a girl making cake pops in a kitchen. The table lists one **Correct Answer** and three **Incorrect Answer** options regarding the sequence of actions taken by the individuals.
**Figure G.14:** Formatted dataset example for Winogrande. The 'partial' evaluation method we use compares the probability of the completion given a correct and incorrect context.
**Figure G.15.** Formatted dataset example for MultiRC. There are three levels within MultiRC: (1) the passage, (2) the questions, and (3) the answers. During evaluation, accuracy is determined at the per-question level, with a question being considered correct if and only if all the answers within the question are labeled correctly. For this reason, we use $K$ to refer to the number of questions shown within the context.
**Figure G.17.** Formatted dataset example for StoryCloze
**Figure G.18.** Formatted dataset example for CoQA
**Figure G.20.** Formatted dataset example for DROP
**Figure G.28.** Formatted dataset example for SQuADv2
**Figure G.31.** Formatted dataset example for RTE
**Figure G.36:** Formatted dataset example for De→En. This is the format for one- and few-shot learning, for this and other langauge tasks, the format for zero-shot learning is “Q: What is the {language} translation of {sentence} A: {translation}.”
**Figure G.41.** Formatted dataset example for Ro→En
Broader Impacts
Broad evaluation reveals strong performance across diverse language tasks, highlighting both opportunities and responsibilities.
We evaluate the model on dozens of benchmarks spanning reasoning (Copa, RTE, WiC, WSC), multi‑choice reading (MultiRC, ReCoRD, SuperGLUE), adversarial NLI (ANLI R1‑R3), and a variety of synthetic and linguistic probes (2D+, 2D-, 3D+, 3D-, 4D+, 4D-, 5D+, 5D-, 2Dx, 1DC, Cycled Letters, Anagrams, Symbol Insertion, Reversed Words, SAT Analogies). Across these tasks the model attains high accuracy and F1 scores, demonstrating a broad set of language capabilities. This breadth suggests both powerful downstream applications and heightened responsibility to mitigate misuse.
Conclusion
The extensive evaluation (Table H.1) confirms that large‑scale language models reliably acquire in‑context learning abilities across diverse tasks.
Training on massive, heterogeneous corpora endows models with a general‑purpose pattern‑recognition capability that can be harnessed through a few examples at inference time, without any weight updates. The numbers in Table H.1 demonstrate that this in‑context learning behavior scales smoothly: larger models and richer data consistently achieve higher scores across all evaluated tasks, while even modestly sized models retain useful few‑shot performance. These findings suggest that future work should focus on improving data diversity and scaling efficiency rather than on task‑specific fine‑tuning.