DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo

DeepSeekMath 7B achieves state-of-the-art open-source math performance via high-quality data mining and efficient RL.

How can we scale mathematical reasoning in open-source language models by combining high-quality data curation with efficient reinforcement learning?

Open-source language models consistently lag behind proprietary systems in complex mathematical reasoning, often due to limited high-quality training data and inefficient alignment techniques. The authors introduce a massive, filtered 120B-token math corpus and Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that replaces the resource-heavy critic model with group-based reward normalization. DeepSeekMath 7B reaches 51.7% on the competition-level MATH benchmark, outperforming all open-source models and rivaling proprietary systems like Gemini-Ultra.

Paper Primer

The authors construct the DeepSeekMath Corpus by iteratively training a fastText classifier on high-quality seed data to mine mathematical content from Common Crawl. This process yielded 120B tokens, which the authors demonstrate is significantly more effective for mathematical reasoning than standard arXiv-based corpora.

GRPO optimizes the model by sampling a group of outputs for each question and calculating advantages based on relative rewards within that group. By eliminating the need for a separate value function (critic model), GRPO significantly reduces the memory and computational overhead typically required for Proximal Policy Optimization (PPO).

DeepSeekMath 7B sets a new performance standard for open-source models on the MATH benchmark.

The model achieves 51.7% accuracy on MATH without external tools, surpassing all open-source counterparts and approaching proprietary models. A 10%+ absolute improvement over existing open-source base models.

Code training acts as a catalyst for mathematical reasoning.

Initializing the model with a code-trained base (DeepSeek-Coder-Base-v1.5) consistently outperformed general-purpose LLM initialization across all mathematical benchmarks. Significant gains in both tool-integrated and non-tool mathematical problem solving.

Why does the paper prioritize code-trained models as the starting point for mathematical reasoning?

The authors observe that code training improves a model's ability to perform multi-step reasoning and tool use, which directly translates to higher performance on quantitative mathematical tasks compared to general-purpose language model initialization.

What is the primary limitation of using arXiv papers for math pre-training, according to the authors?

The authors find that arXiv-based corpora, despite being standard in the field, show no notable improvements or even lead to performance deterioration on quantitative reasoning benchmarks like GSM8K and MATH.

Researchers should shift focus from parameter scale to high-quality, domain-specific data curation and efficient RL alignment, as these factors can allow 7B models to rival significantly larger proprietary systems.

Introduction and Motivation

We expose the math reasoning bottleneck and show how curated data plus GRPO close it.

Mathematical reasoning remains a hard problem for large language models because the data they ingest is noisy and the training pipelines cannot efficiently capture the precise logical structure required for symbolic computation. This paper tackles that gap by curating a massive, high‑quality math corpus and replacing the standard PPO reinforcement‑learning loop with a group‑based variant that eliminates the need for a separate value model.

LLMs struggle with math when the training data is noisy and the learning process cannot efficiently encode the exact symbolic relationships that mathematics demands.

Assume $N=4096$ and $d=4096$; the raw attention tensor has $N^2 \times d = 4096^2 \times 4096 \approx 68$ B entries.

Each entry is a 4‑byte float, yielding $68 \text{B} \times 4 \text{B} \approx 272 \text{GB}$ of memory.

Even with mixed‑precision (2 bytes per entry) the requirement stays above 130 GB, exceeding the capacity of a single high‑end GPU.

This calculation shows why naïve pre‑training on the entire web‑scale math corpus is infeasible without a selective data pipeline.

**Figure 1.** Top1 accuracy of open-source models on the competition-level MATH benchmark (Hendrycks et al., 2021) without the use of external toolkits and voting techniques.

The key shift is moving from generic pre‑training to a math‑specific, high‑quality data curation pipeline, which together with GRPO unlocks strong reasoning without massive compute.

Data Curation and Pre-Training

Building a clean, massive math corpus and a fast iterative pipeline to feed it.

Pre‑training LLMs on mathematics is crippled by noisy web data and sparse high‑quality sources, which limits reasoning depth and multilingual coverage.

A curated, multilingual collection of 120 B tokens that concentrates on clean, high‑quality mathematical text.

How does the DeepSeekMath Corpus differ from simply scaling up an existing math dataset?

Scaling adds more of the same noisy material; the iterative pipeline actively seeks previously missed domains, filters benchmark leakage, and de‑duplicates, so each added token brings new, high‑quality signal rather than redundant noise.

Exact‑match n‑gram filters strip any snippet that could overlap with evaluation benchmarks, preventing the model from memorizing test data.

Why use a 10‑gram exact match instead of a shorter n‑gram?

A 10‑gram is long enough that accidental overlap is extremely unlikely, so removing matches reliably targets genuine benchmark excerpts while preserving the vast majority of legitimate mathematical prose.

The pipeline treats corpus building as a feedback loop: a fastText retriever learns from the current seed, pulls more pages, the new pages enrich the seed, and the cycle repeats until marginal gain vanishes.

Train fastText on the 4 examples for 1 epoch → obtain embeddings for each token.

Score 8 candidate crawl pages; top‑2 scores belong to pages C₁ and C₂.

Deduplicate C₁ and C₂ (they share no URLs with the seed).

Ranked list: C₁ (score 0.87), C₂ (score 0.82), others lower.

Keep C₁ and C₂, expanding the seed to 4 positive pages.

Re‑train fastText on the enlarged seed; the model now assigns higher scores to previously missed pages D₁, D₂ in the next iteration.

The loop quickly captures new domains because each enrichment round reshapes the embedding space, allowing the retriever to recognize patterns it could not see with the initial narrow seed.

**Figure 2.** An iterative pipeline that collects mathematical web pages from Common Crawl.

With a clean, massive corpus in hand, we evaluate its impact by pre‑training a 1.3 B LLM on each candidate corpus for 150 B tokens under a shared training regime.

Results (Table 1, Figure 3) show three clear advantages: higher absolute scores, consistent gains in both English and Chinese benchmarks, and a steeper learning curve that delays saturation.

**Figure 3.** Benchmark curves of DeepSeek-LLM 1.3B trained on different mathematical corpora.

A 7 B‑parameter base model that inherits the coding‑oriented weights of DeepSeek‑Coder‑Base‑v1.5 and is further refined on a math‑heavy mixture.

Tables 2–4 confirm that DeepSeekMath‑Base 7B surpasses all open‑source baselines on math, reasoning, and coding tasks, while matching the coder predecessor on pure code benchmarks.

A closed‑source 7 B (and larger) model trained on massive math data, cited as a strong baseline in Table 2.

Supervised Fine-Tuning

We fine‑tune the base model on curated math instructions and evaluate its step‑by‑step reasoning.

Supervised Fine‑Tuning (SFT) tackles the data‑quality bottleneck by adapting the pre‑trained base model on a high‑quality, instruction‑style math corpus. The goal is to teach the model to produce coherent chain‑of‑thought or tool‑integrated solutions before any reinforcement learning.

SFT treats each math problem–solution pair as a supervised example, letting the model learn to map a question to a step‑by‑step answer just like a student practices solved exercises.

Example E1: “What is 2 + 3?” → Target tokens: “2 + 3 = 5”.

Example E2: “Factor 6 × 9.” → Target tokens: “6 × 9 = 54 → 6 × 9 = (2 × 3) × (3 × 3) = 2 × 3² × 3 = 54”.

Example E3: “Integrate x² dx.” → Target tokens: “∫ x² dx = x³/3 + C”.

During one training step the model sees a random batch of these three examples, computes the loss for each token, and updates its parameters via gradient descent.

After a few hundred such steps the model begins to emit the exact two‑step reasoning pattern for new problems.

Even with a handful of examples the model learns the template “problem → explicit reasoning → answer”, which later scales to the full 776 K‑example corpus.

How does SFT differ from the earlier pre‑training stage?

Pre‑training predicts the next token from massive raw text, learning generic language patterns. SFT, by contrast, provides a concrete input–output mapping for math problems, forcing the model to internalise explicit reasoning steps rather than just statistical continuation.

Training DeepSeekMath‑Instruct 7B uses the SFT dataset with a maximum context of $4K$ tokens. We run $500$ optimization steps, a batch size of $256$, and a constant learning rate of $5e-5$.

Evaluation is performed on four quantitative‑reasoning benchmarks, both with tool use disabled and enabled. Without tools, the model outperforms all open‑source baselines by at least $9$ % on the MATH dataset; with tool integration it reaches $60$ % accuracy on MATH, surpassing every open‑source competitor.

Reinforcement Learning with GRPO

We introduce GRPO, a group-based reinforcement learning approach that eliminates the need for a separate value model.

Standard reinforcement learning for LLMs typically relies on PPO (Proximal Policy Optimization), which requires a value model to estimate the expected reward of a state. Because this value model is often as large as the policy model itself, it imposes a heavy memory and computational burden during training.

Instead of relying on a separate value model to predict rewards, GRPO samples a group of outputs for the same question and uses their relative performance to estimate the baseline.

Calculate group mean: $(10+2+6)/3 = 6$.

Calculate group standard deviation: $\sqrt{((10-6)^2 + (2-6)^2 + (6-6)^2)/3} \approx 3.27$.

Normalize rewards: $\hat{A}_1 = (10-6)/3.27 \approx 1.22$; $\hat{A}_2 = (2-6)/3.27 \approx -1.22$; $\hat{A}_3 = (6-6)/3.27 = 0$.

Normalization centers the advantages around zero within the group, ensuring that the policy is pushed to favor outputs that perform better than the group average.

GRPO supports both outcome and process supervision. In outcome supervision, the normalized reward is applied to the entire output, whereas process supervision provides rewards at each reasoning step, allowing the model to receive feedback on intermediate logic.

**Figure 4.** Demonstration of PPO and our GRPO. GRPO foregoes the value model, instead estimating the baseline from group scores, significantly reducing training resources.

To maintain performance as the policy improves, we employ iterative RL. We continually update the reward model using samples from the current policy and a replay buffer, ensuring the supervision signal remains relevant to the model's evolving capabilities.

**Table.** Performance comparison of various models on English and Chinese mathematical benchmarks, categorized by Chain-of-Thought Reasoning and Tool-Integrated Reasoning, and further subdivided into Closed-Source and Open-Source models.

Discussion and Findings

Code pre‑training synergizes with math training to boost reasoning performance.

Recall that DeepSeekMath tackles the mathematical‑reasoning bottleneck by curating a massive math corpus and replacing PPO with Group Relative Policy Optimization (GRPO), which removes the need for a separate value model.

Two‑stage code‑then‑math pre‑training raises program‑aided reasoning accuracy to 19.1 % on GSM8K + Python, far above the 12.3 % achieved without code pre‑training.

Table 6 shows 19.1 % for the “Code → Math” two‑stage setting versus 12.3 % for the “General → Math” baseline.

**Table 6.** Investigation of how code affects mathematical reasoning under different training settings. We experiment with DeepSeek-LLM 1.3B, and evaluate its mathematical reasoning performance without and with tool use via few-shot chain-of-thought prompting and few-shot program-of-thought prompting, respectively.

**Table 7.** Investigation of how different settings of code and math training affect model performance of language understanding, reasoning, and coding. We experiment with DeepSeek-LLM 1.3B. We evaluate the models on MMLU and BBH using few-shot chain-of-thought prompting. On HumanEval and MBPP, we conduct zero-shot and few-shot evaluations, respectively.

**Table 8.** Effect of math training on different arXiv datasets. Model performance is evaluated with few-shot chain-of-thought prompting.

**Table 9.** Effect of math training on different arXiv corpora, the base model being DeepSeek-Coder-Base-v1.5 7B. We evaluate informal-to-formal proving in Isabelle.

**Table 10.** The data source and gradient coefficient of different methods. $P_{sft}$ denotes the data distribution of supervised fine-tuning datasets. $\pi_{\theta_{sft}}$ and $\pi_{\theta}$ denote the supervised fine-tuned model and the real-time policy model during the online training process, respectively.

The synergy between code training and mathematical reasoning is the key driver of DeepSeekMath’s gains.

Technical Appendix

Detailed comparison of RL objectives, data sources, rewards, and gradient coefficients.

This appendix tabulates the key ingredients of each reinforcement‑learning variant discussed in the paper.

Across methods the data source shifts from a static SFT corpus to live policy samples, while the reward signal moves from human selection toward learned reward models; gradient coefficients range from a fixed scalar to advantage‑based terms that depend on group statistics.

Read the original paper

Open the simplified reader on Paperglide