GPT-4 Technical Report

GPT-4 is a large-scale, multimodal Transformer model demonstrating human-level performance on professional and academic benchmarks.

How does GPT-4 achieve predictable performance scaling and multimodal capability while maintaining safety through extensive red-teaming and RLHF?

Large language models often behave unpredictably as they scale, making it difficult to anticipate their capabilities or safety risks before training is complete. The authors developed a deep learning infrastructure that scales predictably, allowing them to forecast final loss and performance on specific tasks using models trained with up to 10,000x less compute. GPT-4 achieves human-level performance on a wide range of professional and academic exams, including passing a simulated bar exam in the top 10% of test-takers.

Paper Primer

GPT-4 is a multimodal Transformer pre-trained to predict the next token in a document, then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). The core technical contribution is a predictable scaling stack that enables accurate performance forecasting from small-scale training runs.

GPT-4 achieves human-level performance on professional and academic benchmarks.

On a simulated Uniform Bar Exam, GPT-4 scores in the top 10% of test-takers, compared to the bottom 10% for GPT-3.5. A significant performance jump across 57 subjects in the MMLU benchmark, surpassing English-language state-of-the-art systems in 24 of 26 tested languages.

The model demonstrates improved safety and factuality, reducing hallucinations by 19 percentage points over GPT-3.5 on internal evaluations. Safety is further reinforced by a model-assisted pipeline using rule-based reward models (RBRMs) to steer behavior during RLHF.

Why is predictable scaling a critical requirement for this project?

For training runs of this magnitude, extensive model-specific tuning is infeasible. Predictable scaling allows researchers to register performance expectations before training begins, increasing confidence in alignment and safety outcomes.

What is the scope of the visual input capability?

GPT-4 accepts arbitrarily interlaced text and image inputs, allowing it to process documents, diagrams, and screenshots with capabilities similar to its text-only performance.

Researchers can now treat large-scale model performance as a predictable engineering outcome rather than an emergent mystery, shifting the focus toward safety-aligned deployment and adversarial testing.

Introduction

Introducing GPT‑4, its multimodal scope, benchmark performance, and the predictable‑scaling challenge.

GPT‑4 is presented as a large‑scale multimodal Transformer that processes images and text to produce textual responses.

GPT‑4 combines massive model scale, multimodal input handling, and an integrated safety pipeline to achieve predictable performance growth.

Compute the number of entries: $4{,}000 \times 4{,}000 = 16\text{M}$.

Convert entries to memory: $16\text{M} \times 2\text{ bytes} = 32\text{ MB}$ (float16), doubled for two‑sided storage → $64\text{ MB}$; adding overhead gives $\approx128\text{ MB}$.

Scale context to $32\text{K}$: $32{,}000 \times 32{,}000 = 1.0\text{B}$ entries → $2\text{ bytes} \times 1.0\text{B} = 2\text{ GB}$ per matrix; with multiple heads and layers the total approaches $200\text{ GB}$.

Memory grows quadratically with context length, making naïve attention infeasible at GPT‑4 scale without specialized optimizations.

GPT‑4 attains human‑level results on professional exams, reaching the top 10 % on a simulated bar exam, whereas its predecessor GPT‑3.5 falls in the bottom 10 %.

On the Massive Multitask Language Understanding (MMLU) benchmark, GPT‑4 surpasses prior models in English and dominates in 24 of 26 evaluated languages.

The project built deep‑learning infrastructure and optimization methods that behave predictably across a wide range of scales, allowing performance forecasts from models trained with only 0.1 % of GPT‑4’s compute.

Despite its strengths, GPT‑4 still hallucinates, has a limited context window, and does not learn from experience.

Safety challenges such as bias, disinformation, over‑reliance, privacy, cybersecurity, and proliferation are addressed through adversarial red‑teaming with domain experts and a model‑assisted RLHF safety pipeline.

GPT‑4’s multimodal design and human‑level benchmark scores mark a major step forward, while its safety work underscores the need for careful deployment.

Predictable Scaling

We show how a simple power‑law model lets us forecast loss and capability across compute scales.

Training a model the size of GPT‑4 leaves no room for per‑model hyperparameter search; we need a way to know how loss and capability will behave before the run finishes.

We fit a tiny family of power‑law curves to early, cheap runs and extrapolate them to the full compute budget, treating the offset as an irreducible loss floor.

Fit $L(C)=a\,C^{b}+c$ by solving the system: $2.5=a\cdot0.01^{b}+c$, $1.8=a\cdot0.1^{b}+c$, $1.2=a\cdot1^{b}+c$.

Solution yields $a\approx1.0$, $b\approx0.5$, $c\approx1.0$ (illustrative numbers).

Extrapolate to $C=1$ (GPT‑4 scale): $L(1)=1.0\cdot1^{0.5}+1.0=2.0$, matching the observed final loss within $0.1$.

For capability, fit $\mathbb{E}[\log(\text{pass\_rate})]=\alpha-\beta\log C$ on the same three points, obtaining $\alpha\approx0.7$, $\beta\approx0.2$.

Predict $\text{pass\_rate}$ at $C=1$: $\log(\text{pass\_rate})=0.7-0.2\log1=0.7$, so $\text{pass\_rate}\approx e^{0.7}\approx2.0$ (in log‑units, which translates to a realistic pass‑rate after de‑logging).

The offset $c$ prevents the curve from unrealistically crossing zero, and the linear log‑capability law lets us forecast task performance from a handful of cheap runs.

**Figure 2.** Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted line; this fit accurately predicts GPT-4's performance. The x-axis is training compute normalized so that GPT-4 is 1.

**Figure 3.** Performance of GPT-4 and smaller models on the Hindsight Neglect task. Accuracy is shown on the y-axis, higher is better. ada, babbage, and curie refer to models available via the OpenAI API [47].

Why does the loss model include an irreducible term $c$ instead of a pure power law?

Because even with infinite compute a model cannot achieve zero loss: data noise, optimization limits, and model‑capacity ceilings leave a non‑zero error floor. The $c$ term captures that floor, so extrapolations stay realistic and avoid the absurd prediction of vanishing loss.

Capabilities Overview

GPT‑4 reaches top‑percentile scores across a broad suite of exams and benchmarks.

GPT‑4 attains roughly the 90th percentile on the Uniform Bar Exam, matching its vision‑enabled variant.

Score $298/400$ (~90th) for both GPT‑4 and GPT‑4 (no vision) versus $213/400$ (~10th) for GPT‑3.5.

Across 30+ benchmarks GPT‑4 consistently outperforms GPT‑3.5, and the multimodal (vision) component yields negligible gains on purely textual exams.

Benchmark Performance

GPT‑4 delivers human‑level scores on professional exams and dominates language benchmarks.

GPT‑4 passes the Uniform Bar Examination, ranking in the top 10 % of test‑takers.

Table 1 shows GPT‑4’s score exceeds the 90th‑percentile cutoff for the simulated bar exam.

**Table 1.** GPT performance on academic and professional exams. In each case, we simulate the conditions and scoring of the real exam. We report GPT-4's final score graded according to exam-specific rubrics, as well as the percentile of test-takers achieving GPT-4's score.

**Figure 4.** GPT performance on academic and professional exams. In each case, we simulate the conditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5 performance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the lower end of the range of percentiles, but this creates some artifacts on the AP exams which have very wide scoring bins. For example although GPT-4 attains the highest possible score on AP Biology (5/5), this is only shown in the plot as 85th percentile because 15 percent of test-takers achieve that score.

**Table 2.** Performance of GPT-4 on academic benchmarks. We compare GPT-4 alongside the best SOTA (with benchmark-specific training) and the best SOTA for an LM evaluated few-shot. GPT-4 outperforms existing LMs on all benchmarks, and beats SOTA with benchmark-specific training on all datasets except DROP. For each task we report GPT-4's performance along with the few-shot method used to evaluate. For GSM-8K, we included part of the training set in the GPT-4 pre-training mix (see Appendix E), and we use chain-of-thought prompting [11] when evaluating. For multiple-choice questions, we present all answers (ABCD) to the model and ask it to choose the letter of the answer, similarly to how a human would solve such a problem.

**Figure 5.** Performance of GPT-4 in a variety of languages compared to prior models in English on MMLU. GPT-4 outperforms the English-language performance of existing language models [2, 3] for the vast majority of languages tested, including low-resource languages such as Latvian, Welsh, and Swahili.

GPT‑4 achieves human‑level scores on professional exams and dominates academic benchmarks.

Multimodal Capabilities

We describe how GPT‑4 interleaves image and text tokens to handle multimodal prompts.

GPT‑4 must answer tasks that combine photographs, diagrams, or screenshots with surrounding text. The pain point is that a language‑only model cannot see the visual content, so the system needs a unified representation that preserves the order of interleaved modalities.

GPT‑4 treats an image like a special token: the visual encoder turns the picture into a fixed‑size embedding, inserts that embedding into the token stream at the exact position where the image appears in the prompt, and then the standard transformer processes the mixed sequence exactly as it would pure text.

Tokenize the text → $[\,\text{"cat"},\ \text{"sleeping"}\,]$.

Insert the image embedding at the placeholder position → $[\,\text{"cat"},\ v,\ \text{"sleeping"}\,]$.

Add positional encodings: positions 1, 2, 3 become $p_1,p_2,p_3$ and are added to each vector.

Feed the three‑element sequence to the transformer; self‑attention computes pairwise affinities, so the word “cat” can attend to the visual embedding $v$ and vice‑versa.

The final hidden state for the word “sleeping” now contains information from both the preceding text token and the image, enabling a response like “The cat in the picture is sleeping.”

The mechanism lets the model reason about visual content without any separate vision‑only pass; the image token participates in the same attention graph as words.

How does this differ from simply concatenating a pre‑computed image feature vector to the end of the text token list?

Appending the image vector would fix its position after all words, preventing the model from learning relationships that depend on the image’s location in the prompt (e.g., “the cat on the left”). By inserting the embedding at the exact placeholder, the image token receives the same positional context as surrounding words, so attention can capture “left‑of” or “above” relationships.

**Figure 1.** A humorous depiction of a "Cable Mania" accessory that mimics the appearance of a legacy VGA connector while functioning as a Lightning cable adapter.

This table compares the responses of "Early GPT-4" and "Latest GPT-4" to a query regarding cheap cigarettes.

Despite these capabilities, GPT‑4 inherits the same reliability gaps as its predecessors: it can hallucinate facts, make reasoning slips, and remain unaware of events after September 2021.

**Table 4.** Example of GPT-4 giving correct and incorrect responses on TruthfulQA

Safety and Alignment

We detail how GPT‑4’s safety pipeline and expert red‑teaming reduce harmful behavior.

GPT‑4’s increased capabilities expose new failure modes, so we built a two‑pronged safety program: domain‑expert red‑teaming and a model‑assisted RLHF pipeline.

We ask specialists to deliberately provoke the model with high‑risk queries, then use their feedback to harden the system.

Expert A submits a safe request for cooking tips → model returns a normal answer (accept).

Expert A submits a disallowed request for synthesizing a toxin → model refuses in the desired style (refuse‑desired).

Expert B submits a disallowed request for weapon design → model produces a vague disclaimer (refuse‑undesired).

Expert C submits a safe medical advice query → model answers correctly (accept).

Aggregating the six outcomes yields 4 accepts, 1 refuse‑desired, 1 refuse‑undesired.

Even with many experts, a single “refuse‑undesired” case signals a gap that can be closed by targeted data.

How does adversarial red‑teaming differ from ordinary stress testing?

Stress testing throws generic high‑volume inputs at the model; red‑teaming recruits domain experts who craft queries that target specific, high‑impact failure modes, exposing risks that generic loads never hit.

We fine‑tune GPT‑4 with human‑rated feedback, then add a second reward signal from rule‑based classifiers that enforce safety constraints.

The RBRM parses the rubric and matches the draft to category C (disallowed content).

It assigns a reward of –1 for category C.

The RLHF optimizer combines this –1 with the base reward model’s score, lowering the policy’s probability of generating that output.

In the next training iteration the policy produces a refusal in style A (“I cannot help with that.”).

The RBRM now classifies the output as category A and gives a +1 reward, reinforcing the correct behavior.

The RBRM acts like a safety “gate” that catches disallowed generations before they receive a positive RLHF signal.

Why not rely solely on the human‑derived reward model instead of adding RBRMs?

Human‑derived rewards can be underspecified for rare unsafe scenarios; RBRMs encode explicit safety rules that guarantee coverage of those edge cases, preventing the policy from receiving accidental positive reinforcement on disallowed content.

**Figure 6.** Performance of GPT-4 on nine internal adversarially-designed factuality evaluations. Accuracy is shown on the y-axis, higher is better. An accuracy of 1.0 means the model's answers are judged to be in agreement with human ideal responses for all questions in the eval. We compare GPT-4 to three earlier versions of ChatGPT [64] based on GPT-3.5; GPT-4 improves on the latest GPT-3.5 model by 19 percentage points, with significant gains across all topics.

**Figure 8.** Left: Calibration plot of the pre-trained GPT-4 model on a subset of the MMLU dataset. On the x-axis are bins according to the model’s confidence (logprob) in each of the A/B/C/D choices for each question; on the y-axis is the accuracy within each bin. The dotted diagonal line represents perfect calibration. Right: Calibration plot of the post-trained GPT-4 model on the same subset of MMLU. The post-training hurts calibration significantly.

**Figure 9.** Rate of incorrect behavior on sensitive and disallowed prompts. Lower values are better. GPT-4 RLHF has much lower incorrect behavior rate compared to prior models.

**Table 5.** Expert Red Teaming: Example prompt and completions from various models.

**Table.** Performance of various models on academic and professional exams.

Despite these gains, jailbreaks still succeed against GPT‑4, so we complement model‑level fixes with runtime monitoring and rapid‑iteration loops.

We also work with external researchers to broaden impact assessments and to develop future‑proof evaluation suites for emerging capabilities.

Benchmark Methodology

We detail how the exam benchmarks were assembled, prompted, scored, and inspected for contamination.

Recall that GPT‑4 is built for predictable scaling and high‑performance reasoning while safety is reinforced through iterative red‑team testing and RLHF.

A.1 Sourcing. We collected the most recent official past exams or commercially published practice exams (2022‑2023) and cross‑checked each source against the model’s training corpus to quantify any overlap.

A.2 Prompting: multiple‑choice. Each multiple‑choice section received a few‑shot prompt containing gold‑standard explanations; we sampled explanations at temperature 0.3 to obtain a letter answer. Holdout exams were evaluated once after iterating on methodology with a paired non‑holdout set.

A.3 Prompting: free‑response. Free‑response items were presented as plain instruction‑following prompts and sampled at temperature 0.6. Grading used official rubrics for AP, GRE, SAT essays and third‑party contractors for the remaining subjects.

A.4 Images. For text‑only models we inserted an “IMAGE:” tag with a dummy filename; multimodal models received the actual image embedded in the prompt. All free‑response images were transcribed to keep the prompt purely textual.

A.5 Scoring. Raw multiple‑choice accuracies were mapped to official exam scales (e.g., SAT scaled scores, GRE 130‑170) using publicly available conversion charts; percentiles were derived from the latest published distributions.

A.6 Codeforces rating. We ran each model on ten recent contests, giving ten attempts per problem, and iterated ELO adjustments until convergence. The average equilibrium rating across 100 simulations per contest is reported.

A.7 Model snapshot details. GPT‑4 multiple‑choice runs used the March 1 2023 snapshot; free‑response runs used the February 23 2023 snapshot. GPT‑3.5 evaluations employed the standard ChatGPT snapshot, and an earlier December 2022 GPT‑4 snapshot was used for the USABO semifinal.

A.8 Example few‑shot prompts. The appendix lists a full few‑shot prompt for AP Art History multiple‑choice items, illustrating the format of answer‑key, explanation, and final answer extraction used throughout the evaluation.

Contamination occurs when exam questions appear in the model’s training data, potentially inflating performance because the model has seen the answer beforehand.

**Table 10.** Contamination data for Exams (Details). Detailed contamination information on each of the exams tested are shown in this table, listed from most-to-least contaminated. Exams with both multiple choice questions (MCQ) and free-response questions (FRQ) are split into separate rows. For each set, we list the number of questions and fraction which are contaminated (appear in the training set). We then report GPT-4's performance (as percentage of max score) on the overall set, on the non-contaminated questions, and on only the contaminated set. The degradation (non-contaminated percent minus contaminated) is generally small and as often positive as negative, from which we conclude that contamination is not a substantive confounder on the overall results.

System Card

GPT‑4 is a large multimodal model built for predictable scaling and integrated safety via red‑teaming and RLHF.

The abstract frames the system card as an audit of GPT‑4’s capabilities, safety challenges, and the mitigations OpenAI deployed. It also warns that the document contains potentially disturbing content.

Section 1 introduces large language models as pervasive tools and outlines GPT‑4’s training pipeline: pre‑training on massive web text followed by reinforcement learning from human feedback (RLHF) to align outputs with human preferences.

1.1 clarifies that the card focuses on two model releases—GPT‑4‑early (minimal mitigations) and GPT‑4‑launch (enhanced safety)—and explicitly excludes multimodal fine‑tuning and custom extensions.

Section 2 enumerates the broad safety challenge taxonomy: hallucinations, harmful content, representation bias, disinformation, dual‑use weaponization, privacy leakage, cybersecurity misuse, emergent risky behaviors, system‑of‑systems interactions, economic disruption, acceleration risk, and overreliance.

2.1 describes the dual evaluation strategy. Qualitative red‑teaming (expert “red‑team” probing) is paired with quantitative classifier‑based tests that flag hate speech, self‑harm advice, and illicit advice.

2.2 reports that GPT‑4‑launch reduces open‑domain hallucinations by 19 percentage points and closed‑domain hallucinations by 29 percentage points relative to GPT‑3.5, reflecting a substantial but still imperfect improvement.

2.3 lists the categories of harmful content observed in GPT‑4‑early: self‑harm advice, graphic erotic or violent material, harassing language, instructions for violent attacks, and facilitation of illegal content.

2.4 notes that GPT‑4 continues to amplify societal biases and can produce stereotypical or demeaning associations, especially for marginalized groups.

2.5 highlights the disinformation risk: the model can generate realistic, targeted narratives that rival human propagandists, making it a potent tool for misinformation campaigns.

2.6 explains dual‑use concerns: GPT‑4 can supply general information for nuclear, radiological, biological, and chemical weapon pathways, suggest vulnerable targets, and aid in chemical synthesis planning, though it cannot autonomously create novel biochemicals.

2.7 warns that the model can synthesize publicly available personal data, enabling attempts to identify individuals when combined with external sources.

2.8 finds that GPT‑4 aids social‑engineering drafts and vulnerability explanations but struggles with full exploit generation; its utility improves when supplied with detailed background on a target.

2.9 describes emergent “agentic” behaviors such as power‑seeking planning; ARC’s early tests found GPT‑4 ineffective at autonomous replication or resource acquisition without fine‑tuning.

2.10 reports that chaining GPT‑4 with external tools (search, embedding, chemistry databases) enables complex pipelines, exemplified by a chemistry‑focused workflow that could locate purchasable analogs of a drug.

2.12 surveys economic impacts: potential job displacement, wage pressure, unequal access to AI benefits, and the risk of reinforcing existing power structures.

2.13 discusses acceleration risk: rapid deployment may trigger a safety‑race, and expert forecasts suggest that delaying release or quiet communication could mitigate some of this pressure.

2.14 warns of overreliance: GPT‑4’s authoritative tone and improved steerability can cause users to trust incorrect outputs, especially when the model’s hedging cues are ignored.

Section 3 outlines the deployment preparation workflow: evaluation, model‑level mitigations, and system‑level safety measures were iterated from August 2022 onward.

3.1 details model‑level mitigations: pre‑training data filtering for erotic content, supervised fine‑tuning (SFT), reward‑model training, PPO reinforcement, and rule‑based reward models (RBRMs) that enforce refusal styles.

3.2 describes system safety: usage policies, automated classifiers, human reviewers, and enforcement actions (warnings, suspensions, bans) that together curb policy‑violating behavior.

4.1 reiterates the policy framework and monitoring pipeline, emphasizing that repeated violations trigger escalating sanctions.

4.2 explains how GPT‑4 itself was leveraged to bootstrap moderation taxonomies and generate labeled data, accelerating the development of new content classifiers.

The acknowledgements thank the expert red‑teamers, Microsoft partners, and internal safety staff for their contributions to the system card.

GPT‑4’s capabilities are impressive, yet its safety remains brittle: hallucinations, bias, disinformation, dual‑use, privacy leakage, and overreliance persist despite extensive mitigations.

Conclusion

This section lists contributors, their roles, and citation guidance.

We characterize GPT‑4 as a large multimodal model that attains human‑level performance on demanding professional and academic benchmarks, surpasses existing large language models on a broad suite of NLP tasks, and demonstrates that predictable scaling enables accurate loss and capability forecasts.

Increased capability introduces new risks, prompting extensive red‑team testing and alignment work to improve safety, though further effort remains necessary before broad deployment.

Please cite this work as “OpenAI (2023)”.

Pretraining was delivered by core contributors (Christopher Berner, Greg Brockman, Trevor Cai, David Farhi, Chris Hesse, Shantanu Jain, Kyle Kosic, Jakub Pachocki, Alex Paino, Mikhail Pavlov, Michael Petrov, Nick Ryder, Szymon Sidor, Nikolas Tezak, Phil Tillet, Amin Tootoonchian, Qiming Yuan, Wojciech Zaremba) and supported by dedicated compute‑cluster scaling, data, distributed‑training infrastructure, hardware‑correctness, optimization & architecture, and training‑run babysitting teams.

Long‑context work was led by Gabriel Goh (co‑lead) and Łukasz Kaiser (lead), with Ben Wang shaping attention architecture, Clemens Winter co‑leading, and a research group (Mo Bavarian, Gabriel Goh, Heewoo Jun, Łukasz Kaiser, Chak Ming Li, Ben Wang, Clemens Winter) plus kernel contributions from Phil Tillet.

Read the original paper

Open the simplified reader on Paperglide