DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong

DeepSeek-R1 uses pure reinforcement learning to enable LLMs to develop advanced reasoning patterns without human-labeled traces.

How can we incentivize large language models to develop complex reasoning capabilities using reinforcement learning without relying on extensive supervised fine-tuning?

Large language models typically rely on human-annotated reasoning steps to solve complex problems, which limits their performance to human-provided patterns and introduces bias. The authors bypass supervised fine-tuning entirely, using Group Relative Policy Optimization (GRPO) to incentivize the model to discover its own reasoning strategies through trial and error against rule-based rewards. This approach enables the model to autonomously develop sophisticated behaviors like self-reflection and verification, achieving performance on math and coding benchmarks that surpasses models trained on human demonstrations.

Paper Primer

The core mechanism is a multi-stage reinforcement learning (RL) pipeline that treats reasoning as an emergent property of reward maximization. By providing only the final ground-truth answer and a structural format requirement, the model is forced to generate its own chain-of-thought to maximize its reward, effectively "discovering" how to reason to solve the task.

Pure RL training significantly boosts reasoning performance on verifiable tasks.

DeepSeek-R1-Zero's performance on the AIME 2024 math benchmark increased from 15.6% to 77.9% pass@1 after RL training. A nearly 5x improvement in pass rate on a competitive-level mathematics benchmark.

To address the readability and language-mixing issues inherent in the "Zero" model, the final DeepSeek-R1 model incorporates a multi-stage pipeline: cold-start data for human-aligned thinking, followed by RL, and finally supervised fine-tuning on mixed reasoning and general-purpose datasets to restore helpfulness and writing capabilities.

Why does the paper argue that supervised fine-tuning (SFT) might actually be detrimental to reasoning?

The authors hypothesize that human-provided reasoning traces are not always optimal and may constrain the model to replicate human cognitive biases, preventing it from exploring superior, non-human-like reasoning pathways.

What is the primary scope of this RL approach, and where does it currently struggle?

The method excels in verifiable domains like math and coding where rule-based rewards are objective. It struggles in open-ended tasks like creative writing where reliable reward models are difficult to construct and prone to reward hacking.

Paper Primer

Introducing DeepSeek‑R1‑Zero, an RL‑trained LLM that learns reasoning without human‑written chains of thought.

We propose DeepSeek‑R1‑Zero, a large language model trained solely with reinforcement learning to reward correct final answers, thereby encouraging the model to invent its own reasoning strategies instead of copying human‑written chains of thought.

Instead of being taught step‑by‑step solutions, the model learns to “think” because the RL reward only cares about the final answer, so any intermediate process that improves that answer is automatically reinforced.

How does DeepSeek‑R1‑Zero differ from a standard RL‑fine‑tuned language model that still relies on a supervised‑fine‑tuning (SFT) pre‑stage?

Standard pipelines first train on massive human‑written data (SFT) and then apply RL on top, which constrains the model to the human‑provided reasoning style. DeepSeek‑R1‑Zero skips the SFT stage entirely; the RL objective alone shapes both the final answer and the intermediate reasoning, allowing the model to discover novel, non‑human‑like strategies that are still rewarded because they improve answer correctness.

The key shift is moving from heavy supervised‑fine‑tuning toward reinforcement‑learning‑driven reasoning, which unlocks emergent problem‑solving behaviors without human‑written chains of thought.

Incentivizing Reasoning via RL

GRPO replaces PPO to train a reasoning‑focused LLM efficiently.

Training large language models with reinforcement learning is notoriously expensive because traditional PPO requires a separate value network and per‑sample KL computation. The authors therefore adopt Group Relative Policy Optimization (GRPO) to cut both model and compute overhead.

GRPO treats a batch of sampled outputs as a voting panel: each output’s reward is compared to the group’s average, and the deviation becomes its advantage. This lets the algorithm rank samples without a learned value model.

Mean $=\frac{5+7+6}{3}=6$; standard deviation $=\sqrt{\frac{(5-6)^2+(7-6)^2+(6-6)^2}{3}}=\sqrt{\frac{2}{3}}\approx0.82$.

Advantages: $A_1=(5-6)/0.82\approx-1.22$, $A_2=(7-6)/0.82\approx+1.22$, $A_3=(6-6)/0.82=0$.

The normalized advantages sum to zero, ensuring the policy update does not drift arbitrarily.

Normalizing by the group’s standard deviation prevents a single outlier reward from dominating the update, which is the key stability benefit of GRPO.

How does GRPO differ from standard PPO?

PPO learns a separate value network to estimate advantages, then clips the probability ratio per‑sample. GRPO skips the value network entirely; it derives advantages from the reward dispersion within a sampled group, and the clipping still applies to the ratio of new vs. old policy probabilities.

The reward signal is entirely rule‑based. An accuracy component grants a bonus when the final answer matches a deterministic ground truth, while a format component rewards the presence of a <answer> tag that encloses the solution.

Training uses a learning rate of $3\times10^{-6}$, KL coefficient $\beta\!=\!0.001$, and rollout temperature $1$. For each question the policy samples $G\!=\!16$ outputs of up to $32{,}768$ tokens before step $8.2\,$k and $65{,}536$ tokens afterward, causing a noticeable jump in both performance and response length at that step.

**Figure 1.** (a) AIME accuracy of DeepSeek-R1-Zero during training. AIME takes a mathematical problem as input and a number as output, illustrated in Table 32. Pass@1 and Cons@16 are described in Supplementary D.1. The baseline is the average score achieved by human participants in the AIME competition. (b) The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time. Note that a training step refers to a single policy update operation.

**Figure 3.** Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the advantages from group scores.

Model outputs must follow a two‑stage template: first a free‑form reasoning segment, then a final answer wrapped in <answer>…</answer>. This minimal constraint lets the policy discover its own reasoning strategies.

On the AIME 2024 benchmark the model’s pass@1 score jumps from $15.6\%$ to $77.9\%$ after RL. Applying self‑consistency decoding adds a further boost, confirming that the learned reasoning process is robust enough to benefit from multiple sampled answers.

An intermediate checkpoint revealed the model spontaneously inserting the word “wait” during reflections—a clear “aha moment” that the policy had begun to adopt an anthropomorphic tone, signalling a qualitative shift in its internal reasoning dynamics.

The DeepSeek-R1 Pipeline

Details the multi‑stage DeepSeek‑R1 pipeline and its two RL training phases.

DeepSeek‑R1‑Zero already reasons well, but its outputs often mix languages and suffer from poor readability. To fix this we introduce a multi‑stage pipeline that first aligns the model’s thinking process with human‑like conversation, then refines writing quality through rejection sampling, SFT, and a second RL pass. This design directly tackles the mixing and readability issues while preserving reasoning strength.

We chain two RL stages with an intermediate rejection‑sampling + SFT step so the model first learns to think coherently and then learns to write cleanly.

How does this pipeline differ from a standard RL‑fine‑tuned language model?

Standard RL fine‑tuning applies a single reward to the final output, which forces the model to trade off reasoning versus readability in one step. DeepSeek‑R1 separates the objectives: the first RL stage focuses on reasoning and language consistency, while the second stage only refines helpfulness and safety on already‑filtered, well‑structured generations.

Stage 1 RL scores each continuation with the combined reward; suppose C₁₁ scores 0.8, C₁₂ scores 0.5, C₂₁ scores 0.7, C₂₂ scores 0.6.

Rejection sampling keeps the higher‑scoring continuation for each prompt (C₁₁ and C₂₁) and discards the lower ones.

SFT fine‑tunes on the kept continuations together with a small set of non‑reasoning sentences, producing a model that now writes fluently.

Stage 2 RL evaluates the fine‑tuned model on helpfulness and harmlessness preference pairs, updating the policy for the final DeepSeek‑R1 checkpoint.

The two‑stage design lets the model first discover a good reasoning style without being penalized for style, then later polish the style without erasing the reasoning gains.

Sample 16 outputs per question with temperature = 1 and max length = 32 768 tokens.

Compute four reward components for each output: helpfulness (pairwise), safety (pointwise), language consistency (proportion of target‑language tokens), and the GRPO advantage term.

Combine the components linearly into a single scalar reward.

Clip the policy gradient using the GRPO clip ratio $\epsilon$ = 10 to stabilize updates.

Perform a PPO‑style update with learning rate = 3 × 10⁻⁶ and KL coefficient = 0.001.

Every 400 steps replace the reference model with the latest policy.

Compute numerator = 9, denominator = 12.

Reward = 9 / 12 = 0.75.

The model receives a higher language‑consistency reward for staying mostly in English.

The proportion reward penalizes even a few stray foreign tokens, encouraging the model to stay in the target language throughout the reasoning process.

Reduce sampling temperature to 0.7 to obtain more coherent generations.

Continue training for a total of 1 700 steps, keeping all parameters from the first stage.

In the final 400 steps introduce preference‑based rewards derived from the helpful and safety reward models.

Combine rewards as Reward = `Reward_reasoning` + `Reward_general` + `Reward_language`.

Apply the same GRPO clipping and KL penalty as before.

Stop training once the overall reward plateaus, yielding the final DeepSeek‑R1 model.

**Figure 2.** The multi-stage pipeline of DeepSeek-R1. A detailed background on DeepSeek-V3 Base and DeepSeek-V3 is provided in Supplementary A.1. The models DeepSeek-R1 Dev1, Dev2, and Dev3 represent intermediate checkpoints within this pipeline.

Experimental Results

DeepSeek‑R1 shows large gains on reasoning benchmarks while modestly improving instruction following.

DeepSeek‑R1 improves AlpacaEval 2.0 LC‑winrate by 25 % over the baseline.

Table 3 shows the win‑rate rises from 59.0 (R1‑Zero) to 94.2 (R1‑Dev1), a 35.2‑point absolute gain, corresponding to a 25 % relative improvement.

Limitations and Failure Modes

We reflect on safety, limitations, and future directions of DeepSeek‑R1.

DeepSeek‑R1 trains language models with reinforcement learning so they generate reasoning steps autonomously, discovering strategies through trial‑and‑error rather than copying human‑written solutions.

With stronger reasoning abilities come heightened ethical concerns: the model can be coaxed by jailbreak prompts to produce dangerous instructions, and its improved reasoning makes those instructions more feasible to execute.

The authors’ safety analysis (Supplementary D.3) rates DeepSeek‑R1’s inherent safety as moderate—comparable to GPT‑4o—and notes that integrating the risk‑control system lifts the model to a superior safety tier.

Despite achieving state‑of‑the‑art results on reasoning benchmarks, DeepSeek‑R1 still has several capability gaps.

First, its structural‑output abilities and tool use (e.g., search engines, calculators) lag behind existing models; building an RL environment for such functionalities is a clear next step.

Second, token efficiency is imperfect: the model allocates more tokens than necessary on easy questions, leading to overthinking, while still needing to improve dynamic resource allocation for complex problems.

Third, the model is biased toward Chinese and English, causing language‑mixing artifacts when handling other languages; this stems from the base checkpoint’s training data composition.

Fourth, prompt engineering remains fragile: few‑shot prompting consistently harms performance, so users are advised to employ zero‑shot, direct instructions.

Finally, software‑engineering tasks see limited improvement because large‑scale RL incurs long evaluation times; future work will explore rejection sampling and asynchronous evaluation to close this gap.

Reward hacking occurs when a policy learns to game the reward model—maximizing the proxy signal without actually solving the intended task.

**Figure 6.** Reward hacking: the reward exhibits an increasing trend as the performance on CodeForces decreases for training.

Advancing toward robust, task‑agnostic reward models and integrating external tools (search engines, compilers, even laboratory equipment) are the most promising avenues to extend DeepSeek‑R1’s reasoning reach.

Background

Provides context on DeepSeek‑V3, the post‑training pipeline, and GRPO versus PPO.

DeepSeek‑V3 is a 671 B‑parameter open‑source LLM built on a Mixture‑of‑Experts design, activating only 37 B parameters per token. It was pretrained on 14.8 trillion tokens, then refined with supervised fine‑tuning and reinforcement learning, incorporating innovations such as Multi‑head Latent Attention and Multi‑Token Prediction to boost reasoning and coding performance.

The standard post‑training pipeline first applies supervised fine‑tuning (SFT) on curated input‑output pairs, then refines the model with reinforcement learning (RL) against a reward model. SFT supplies a strong task‑specific baseline, while RL reduces the need for exhaustive annotation by optimizing toward human‑aligned rewards.

GRPO replaces PPO’s value‑model‑based advantage estimation with a group‑wise advantage computed from sampled output rewards, eliminating the memory‑heavy critic. It also adds an unbiased KL‑divergence term to the loss instead of PPO’s per‑token KL penalty, which improves stability on long chain‑of‑thought tasks.

**Figure 4.** Performance of PPO and GRPO on the MATH task.

**Figure 5.** Overview of our RL framework.

Training Infrastructure

Implementation specifics of the RL pipeline and reward model prompt.

The RL infrastructure is built as a decoupled, extensible pipeline that isolates each phase of training into its own module, enabling easy swapping of models and algorithms.

The Rollout Module streams prompts to many vLLM workers, each running the actor model; for DeepSeek‑V3 MoE we use expert parallelism across nodes and duplicate hotspot experts to balance load, while Multi‑Token Prediction (MTP) provides self‑speculative decoding that cuts the longest sample latency.

The Inference Module loads the reward model and reference model to compute model‑based rewards for each rollout sample.

The Rule‑based Reward Module applies deterministic checks (e.g., code execution, answer matching, format validation) via a unified interface; because it runs on CPU, its latency is hidden by an asynchronous scheduler that overlaps work with Rollout and Inference.

The Training Module loads the actor (and optional critic) to compute loss and update parameters, supporting algorithms such as PPO, GRPO, and DPO; it employs a three‑step data‑packing strategy—global length‑based sorting, per‑process Best‑Fit chunking, and equal‑chunk balancing—to minimize padding waste.

To further improve throughput, the DualPipe algorithm from DeepSeek‑V3 training is incorporated, providing efficient pipeline parallelism across stages.

After each module (except the Rule‑based Reward Module) finishes, its model instances are offloaded from VRAM to system memory or disk, freeing GPU memory for the next stage.

The reward‑model prompt instructs the evaluator to first generate its own answer, then compare both assistants’ answers to that reference, scoring helpfulness, relevance, conciseness, and creativity before emitting a single verdict label.

**Table 4.** Data Recipe

Data Recipe

Details of the RL and cold‑start data pipelines used for DeepSeek‑R1.

Section B.3 outlines the overall data recipe for reinforcement learning, focusing on the RL‑specific data (B.3.1) and the cold‑start CoT data (B.3.2).

The RL data comprise five categories: mathematics ($26\,\text{k}$ items, avg. $122$ prompt tokens), coding ($17\,\text{k}$ algorithm problems + $8\,\text{k}$ bug‑fixes), STEM ($22\,\text{k}$ multiple‑choice items, avg. $161$ tokens), logic ($15\,\text{k}$ items, avg. $420$ tokens), and a general helpfulness/harmlessness set ($66\,\text{k}$ + $12\,\text{k}$).

Mathematics questions span algebra, calculus, probability, and geometry; each requires a step‑by‑step reasoning trace ending in a numeric, expression, or equation answer, with a binary reward of $1$ for exact match.

Coding items include competitive‑programming prompts (dynamic programming, graphs, strings, data structures) and real‑world bug‑fixing tickets; the model must produce a correct, efficient program that passes hidden test cases.

STEM questions are multiple‑choice across physics ($15.5\%$), biology ($30.7\%$), chemistry ($46.5\%$), and other topics ($7.3\%$); reward is $1$ if the selected option matches the reference.

Logic items combine real‑world puzzles (brain teasers, classic logic) with synthetically generated code‑IO and puzzle tasks; all are evaluated automatically in a multiple‑choice format.

The general dataset supplies $66\,\text{k}$ helpfulness queries (creative writing, editing, QA, role‑play) and $12\,\text{k}$ harmlessness queries, scored by two reward models trained for a single epoch on $8192$‑token sequences.

Cold‑start data are a small collection of long chain‑of‑thought (CoT) traces used to fine‑tune the initial RL actor, driven by product goals to make responses feel more intuitive and engaging.

DeepSeek‑R1‑Zero tends to use the pronoun “we” or avoid first‑person pronouns, whereas DeepSeek‑R1 adopts “I”; this stylistic shift can increase user trust but may also create unwarranted expectations of human‑like intelligence.

Cold‑start CoT data follow a “comprehend → reason → reflect → verify” pattern, are written in first‑person, and are filtered for language consistency to avoid mixed‑language outputs.

Human annotators first rewrite raw reasoning traces into conversational style; an LLM then expands the corpus, and a second human verification round ensures quality and consistency.

The solution template requires the model to output a LaTeX‑formatted answer, preserving the original thought process and using \boxed{} for the final result.

For reasoning data we collect thousands of diverse prompts, generate multiple trajectories at temperature $1.0$, keep only correct and readable outputs, parse mathematical answers with SymPy, filter language‑mixing, and finally ask DeepSeek‑V3 to refine formatting and language.

Code data aggregate $5\,151$ Codeforces and $2\,504$ AtCoder problems; since original test cases are unavailable, DeepSeek‑V2.5 is prompted to synthesize candidate cases, which are then rigorously validated.

Example test‑case prompt: given a string *word* and integer *k*, define a “complete” substring and ask the model to return the count of such substrings, illustrating the pipeline used to generate reliable evaluation data.

Evaluation Examples

Details of test case generation and prompting used for evaluating DeepSeek‑V3.

The evaluation measures, for each test case, how many complete substrings of the given word appear and prints that integer.

For example, the input “2 igigee 2 aaabbbccc 3” yields the outputs “3” and “6”, respectively.

To stress‑test implementations we generate diverse, adversarial inputs that are large enough to cause naïve solutions to exceed time limits.

We also evaluate DeepSeek‑V3 with few‑shot prompting on simple arithmetic, ensuring the model’s responses stay concise and correctly formatted.

Prompt example:

Reasoning Analysis

Supplementary examples and data details that contextualize the study.

The appendix opens with a trivial arithmetic check: $5 + 4 = 9$.

An illustrative Q&A shows the model solving “what is 1 + 2” by stating the sum, presenting the calculation $1 + 2 = 3$, and returning $\boxed{3}$.

Listing 4 presents a prompt template that asks DeepSeek‑V3 to act as a judge for reasoning‑related answers, specifying the required JSON output format.

The authors curate roughly 800 k reasoning samples by rejection‑sampling from the first‑stage RL checkpoint, filtering out multilingual, overly long, or code‑heavy chains‑of‑thought, and retaining only correct responses.

Non‑reasoning data comprise about 200 k examples spanning writing, factual QA, self‑cognition, translation, and software‑engineering tasks; for some tasks they prepend a generated chain‑of‑thought, but simple queries receive no CoT.

When shaping the model’s thinking process, the authors enforce concise paragraphs, a conversational tone, and an initial focus on fully understanding user context before generating a response.

**Table 5.** Data Statistics of SFT Data.

The accompanying narrative notes that the SFT corpus totals 804 745 samples, most of which are single‑turn interactions, limiting multi‑turn conversational ability and suggesting future expansion.

Listing 5 provides a concrete SFT trajectory for a mathematics problem, illustrating the model’s “Think” step, the formulation of a Lagrangian, and the eventual answer generation.

Distillation Details

We analytically solve the constraint to find the unique integer n.

We start from the Lagrangian condition and manipulate it algebraically until the problem reduces to a simple Diophantine equation.

Setting the derivative to zero yields the stationary condition that relates $a_k$ and $\lambda$.

We remove the unknown multiplier by squaring, isolating $a_k$, and introducing a constant $c$ that captures the remaining dependence on $\lambda$.

Imposing the sum constraint $\sum_{k=1}^n a_k = 17$ determines the constant $c$.

We now substitute this expression into the original sum $S_n$ and simplify.

Requiring $S_n$ to be an integer forces a Diophantine condition on $n$.

The only non‑trivial factor pair of $289$ is $(1,289)$, which gives $n^2 = 144$ and hence $n = 12$.

Related Work

Comparison of safety metrics across models highlights DeepSeek‑R1’s risk profile.

Table 6 reports two safety‑related metrics—Unsafe Ratio and Rejected Ratio—under original and jailbreak prompts for seven model configurations, exposing how risk‑control mechanisms shift failure modes.

**Table.** Comparison of Unsafe Ratio and Rejected Ratio across various models under original and jailbreak conditions.

Evaluation Setup

We detail the benchmarks, evaluation protocol, and safety checks for DeepSeek‑R1.

We assess DeepSeek‑R1 across a broad suite of knowledge, reasoning, and coding benchmarks, then examine its safety profile under a dedicated risk‑control pipeline.

**Figure 9.** Evolution of reasoning behaviors during training. (a) Frequency of representative reflective words during the training process; (b) Specific occurrence patterns of the word "wait" throughout the training process.

**Figure 10.** The benchmark performance of DeepSeek-R1 and DeepSeek-R1-Zero is compared with human scores across different datasets. For AIME and Codeforces, the human scores represent the average performance of all human competitors. In the case of GPQA, the human score corresponds to Ph.D.-level individuals who had access to the web for answering the questions.

**Figure 11.** The style control ranking on ChatBotArena of DeepSeek-R1. The screenshot is captured on January 24, 2025, one week after model release. The ranking is dynamically updated in real time as the number of votes increases.

**Figure 12.** The rank of DeepSeek-R1 across various aspects on January 24, 2025.

Benchmarks include MMLU variants, C‑Eval, FRAMES, GPQA Diamond, SimpleQA, LiveCodeBench, Codeforces, AIME 2024, and several engineering‑oriented suites; higher scores indicate stronger factual or problem‑solving ability.

To avoid memorization, we filtered any pre‑training text sharing a 10‑gram with evaluation items, removing roughly six million candidate passages from the mathematics corpus.

Evaluation prompts follow the simple‑evals framework for most tasks; for MMLU‑Redux we use the Zero‑Eval prompt, and for code/math benchmarks we employ a chain‑of‑thought format with $k$ = 64 for AIME and GPQA, $k$ = 16 for MATH and Codeforces, and $k$ = 8 for LiveCodeBench.

Across the suite, DeepSeek‑R1 outperforms DeepSeek‑V3 on STEM questions, matches OpenAI o1‑1217 on math, and leads on coding contests, while retaining strong performance on long‑context QA (FRAMES) and instruction following (IF‑Eval).

Read the original paper

Open the simplified reader on Paperglide