DeepSeek-V3 Technical Report

DeepSeek-V3 is a 671B-parameter MoE model achieving state-of-the-art open-source performance via FP8 training and architectural innovations.

How can we scale a Mixture-of-Experts model to 671B parameters while maintaining efficient inference and training costs?

Large-scale Mixture-of-Experts (MoE) models often struggle with high training costs, communication bottlenecks, and performance degradation caused by traditional load-balancing auxiliary losses. DeepSeek-V3 addresses these by combining Multi-head Latent Attention (MLA) for efficient inference with an auxiliary-loss-free load-balancing strategy and a multi-token prediction training objective. The system utilizes a custom FP8 mixed-precision framework and the DualPipe algorithm to overlap computation and communication, enabling training on 14.8 trillion tokens at a fraction of the cost of dense models. The model outperforms leading open-source alternatives and rivals closed-source models like GPT-4o on major benchmarks, requiring only 2.788 million H800 GPU hours for full training.

Paper Primer

The core architectural innovation is the auxiliary-loss-free load-balancing strategy, which replaces rigid auxiliary losses with a dynamic bias term added to expert affinity scores. This allows the model to maintain balanced expert utilization without the performance penalties typically associated with forcing load balance during training.

To overcome the communication bottleneck inherent in cross-node MoE training, the authors developed DualPipe. This pipeline parallelism algorithm schedules micro-batches to overlap forward and backward computation with all-to-all communication, achieving near-zero communication overhead as the model scales.

DeepSeek-V3 achieves state-of-the-art performance among open-source models on math and coding benchmarks.

The model outperforms LLaMA-3.1 405B and Qwen2.5 72B on the majority of technical benchmarks, including MATH-500 and LiveCodeBench. It achieves 88.5 on MMLU and 59.1 on GPQA, narrowing the gap with leading closed-source models.

Why use a multi-token prediction (MTP) objective during training?

MTP densifies training signals by requiring the model to predict multiple future tokens at each position, which improves data efficiency and allows the model to pre-plan representations for better future-token prediction.

How does the model maintain performance while using low-precision FP8 training?

The authors use fine-grained tile-wise and block-wise quantization to mitigate dynamic range issues, combined with a promotion-to-CUDA-Cores strategy that accumulates partial results in FP32 to maintain numerical stability.

DeepSeek-V3 demonstrates that extreme training efficiency and state-of-the-art performance are achievable for massive MoE models through the co-design of hardware-aware communication algorithms and low-precision training frameworks.

Introduction and Performance Overview

DeepSeek‑V3 scales to 671 B parameters while keeping training cost under 3 M GPU‑hours.

DeepSeek‑V3 is an open‑source Mixture‑of‑Experts language model that pushes the frontier of scale (671 B parameters) while remaining cost‑effective, thanks to a suite of architectural and systems innovations.

The authors aim to demonstrate that open‑source LLMs can achieve both massive scale and practical efficiency, proving that size need not entail prohibitive compute or memory costs.

**Figure 1.** Benchmark performance of DeepSeek-V3 and its counterparts.

**Table 1.** Training costs of DeepSeek-V3, assuming the rental price of H800 is \$2 per GPU hour.

**Figure d.** Layers 19-25

**Figure e.** Layers 25-27

DeepSeek‑V3 reaches 671 B total parameters while activating only 37 B parameters per token.

Model description in the introduction specifies the total and per‑token active parameter counts.

Training the full DeepSeek‑V3 pipeline required just 2.788 M H800 GPU‑hours.

Table 1 provides the detailed cost breakdown confirming the total.

DeepSeek‑V3 proves that a 671 B‑scale open‑source model can be built and trained efficiently, delivering top‑tier performance at a modest hardware budget.

Model Architecture and Training Objectives

DeepSeek-V3 integrates Multi-head Latent Attention, DeepSeekMoE, and Multi-Token Prediction for efficient, scalable performance.

DeepSeek-V3 builds upon the Transformer framework, optimizing for both inference efficiency and training economy. The architecture centers on two primary mechanisms: Multi-head Latent Attention (MLA) to compress the KV cache and DeepSeekMoE to enable sparse, expert-based computation.

MLA compresses the massive key-value (KV) cache required during inference by using low-rank joint compression, allowing the model to store only latent vectors instead of full attention heads.

DeepSeekMoE improves training economy by using finer-grained experts and isolating specific experts as "shared" to capture common knowledge, while routing others dynamically.

**Figure 2.** Illustration of the basic architecture of DeepSeek-V3. Following DeepSeek-V2, we adopt MLA and DeepSeekMoE for efficient inference and economical training.

MTP densifies training signals by forcing the model to predict multiple future tokens at each position, which encourages the model to pre-plan representations for better future-token prediction.

**Figure 3.** Illustration of our Multi-Token Prediction (MTP) implementation. We keep the complete causal chain for the prediction of each token at each depth.

Infrastructure and Training Optimizations

Efficient pipeline and FP8 tricks cut communication stalls and memory use.

Cross‑node expert parallelism makes the compute‑to‑communication ratio hover around $1\!:\!1$, turning communication into a hard bottleneck. The authors answer this by redesigning the pipeline so that forward and backward phases overlap, and by squeezing precision to FP8.

DualPipe schedules forward and backward micro‑batches in opposite directions, letting the two halves of a pipeline hide each other’s communication.

How does DualPipe differ from the classic 1F1B pipeline that also overlaps forward and backward?

Classic 1F1B interleaves a single forward and backward chunk per stage, leaving a “bubble” when the backward chunk must wait for the forward dispatch to finish. DualPipe instead runs two full pipelines in opposite directions, so the forward dispatch of one direction can be hidden by the backward compute of the other, eliminating the bubble entirely.

Time 0: Stage 0 starts forward chunk F0; Stage 1 starts backward chunk B3 (the last micro‑batch).

Time 1: Stage 0 finishes dispatch of F0 and immediately begins compute of B0; Stage 1 finishes compute of B3 and begins dispatch of B3.

Time 2: Stage 0’s compute of B0 overlaps with Stage 1’s dispatch of B3; both communications are hidden.

Time 3: All four micro‑batches have progressed one step without any idle slot, demonstrating zero pipeline bubbles.

By feeding micro‑batches from opposite ends, DualPipe turns the communication phases of one direction into “free” work for the other, achieving near‑zero idle time.

**Figure 4.** Overlapping strategy for a pair of individual forward and backward chunks (the boundaries of the transformer blocks are not aligned). Orange denotes forward, green denotes "backward for input", blue denotes "backward for weights", purple denotes PP communication, and red denotes barriers. Both all-to-all and PP communication can be fully hidden.

**Figure 5.** Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.

FP8 training keeps the bulk of matrix multiplications in an 8‑bit format while rescuing accuracy with fine‑grained scaling and occasional high‑precision accumulation.

Why not simply train in BF16 if FP8 offers the same speed?

BF16 keeps the full 8‑bit mantissa but still uses a 16‑bit exponent, consuming twice the memory of FP8. FP8 halves the storage for activations and weights, and the fine‑grained scaling plus periodic FP32 accumulation recovers the lost precision, giving comparable loss curves with half the bandwidth.

Compute per‑tile max: $\max|X|=0.8$, $\max|W|=0.6$.

Derive scaling factors $s_X=2^{\lceil\log_2 0.8\rceil}=1$, $s_W=2^{\lceil\log_2 0.6\rceil}=1$ (no change).

Cast $X$ and $W$ to FP8 (E4M3) using the scales; values become $[0.5, -0.3, 0.8, -0.1]$ etc.

Perform GEMM in FP8; after every $N_C=2$ products, accumulate into a FP32 register.

Final FP32 result matches the BF16 baseline within $0.2\%$ error.

Even with a single tile, the per‑tile scale prevents a single large outlier from saturating the entire FP8 range, and the periodic FP32 accumulation eliminates the mantissa‑driven error.

**Figure 6.** The overall mixed precision framework with FP8 data format. For clarification, only the Linear operator is illustrated.

**Figure 7.** (a) We propose a fine-grained quantization method to mitigate quantization errors caused by feature outliers; for illustration simplicity, only Fprop is illustrated. (b) In conjunction with our quantization strategy, we improve the FP8 GEMM precision by promoting to CUDA Cores at an interval of $N_C = 128$ elements MMA for the high-precision accumulation.

Together, DualPipe’s bidirectional scheduling and FP8’s fine‑grained quantization let DeepSeek‑V3 scale to 671 B parameters without exploding memory or communication costs.

Pre-Training Data and Strategy

Key pre‑training results and the data‑centric tricks that enable them.

DeepSeek‑V3 attains a training efficiency of 180 K H800 GPU‑hours per trillion tokens, far lower than dense baselines of comparable scale.

Measured under the authors’ training framework; a 72 B dense model requires >300 K GPU‑hours per trillion tokens.

Instead of only predicting the next token, the model learns to reconstruct a missing middle segment given its surrounding prefix and suffix, forcing it to use bidirectional context.

How does FIM differ from ordinary next‑token prediction?

Next‑token prediction only ever sees a left‑hand context and must guess the immediate next token. FIM hides an entire middle chunk and asks the model to generate it from both left and right contexts, turning a unidirectional task into a bidirectional reconstruction problem.

YaRN rescales the positional‑embedding decay so that attention scores diminish more slowly with distance, letting the model keep useful signals even when the sequence stretches to 128 K tokens.

Why modify only the decoupled shared key $k_R^t$?

The shared key is the bottleneck that aggregates information across experts. Rescaling it directly expands the effective receptive field without altering the per‑expert queries or values, keeping the sparsity pattern intact while granting longer‑range attention.

Data construction expands the math and programming sample ratio, adds more languages beyond English and Chinese, and uses document packing to reduce redundancy while preserving diversity. The tokenizer’s 128 K vocabulary introduces combined punctuation‑line‑break tokens, which can cause token‑boundary bias; random splitting of such tokens during training mitigates this effect.

Hyper‑parameters yield a 671 B model with 37 B activated parameters per token. MoE layers replace most FFNs, each routing 8 experts per token across 256 experts (1 shared + 255 routed). Multi‑Token Prediction (MTP) adds a single extra token prediction per step, discarded at inference to keep cost unchanged.

Training uses AdamW ($\beta_1\!=\!0.9$, $\beta_2\!=\!0.95$, weight decay = 0.1), a cosine LR schedule that settles at $7.3\times10^{-6}$, and gradient clipping at norm 1.0. Batch size grows from 3 072 to 15 360 early on, then stays constant. Pipeline parallelism spreads layers over 64‑GPU groups on eight nodes, with each token routed to at most four nodes (M = 4).

**Figure 8.** Evaluation results on the "Needle In A Haystack" (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to 128K.

Table 3 compares DeepSeek‑V3‑Base against DeepSeek‑V2‑Base, Qwen2.5 72 B, and LLaMA‑3.1 405 B. DeepSeek‑V3 leads on most benchmarks, especially math and code, and matches or exceeds the 405 B dense model despite activating only 37 B parameters per token.

Enriching the pre‑training mix with more math, code, and multilingual data directly expands the model’s capability envelope across diverse evaluation suites.

Supervised Fine-Tuning

Post‑training boosts performance across benchmarks, notably a 20% win‑rate gain.

DeepSeek‑V3 attains an 86% win rate on the Arena‑Hard benchmark, a 20% absolute improvement over the strongest open‑source competitor.

DeepSeek‑V3 scores 85.5 % versus 70.0 % for DeepSeek‑V2.5‑0905 in Table 7.

We curate 1.5 M instruction instances, generating reasoning samples with DeepSeek‑R1 and non‑reasoning samples with DeepSeek‑V2.5 plus human verification. Supervised fine‑tuning runs for two epochs with cosine‑decayed learning rates from 5e‑6 to 1e‑6, and a sample‑masking strategy keeps packed sequences isolated. Reinforcement learning employs a hybrid rule‑based and model‑based reward model and optimizes the policy with Group Relative Policy Optimization, sampling output groups from the old policy.

Table 6 ranks DeepSeek‑V3 as the top open‑source model across English, code, math, and Chinese benchmarks, with especially strong scores on AIME (39.2) and HumanEval‑Mul (82.6). It attains 91.6 F1 on DROP (3‑shot) and sets new state‑of‑the‑art scores on AIME and MATH‑500, exceeding the second‑best by roughly 10 % absolute. Open‑ended evaluations in Table 7 confirm its 85.5 % win rate on Arena‑Hard, the highest among all models.

**Table.** Comparison of DeepSeek V3 with other state-of-the-art models across various benchmarks including English, Code, Math, and Chinese tasks.

**Table 7.** English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric.

**Table 8.** Performances of GPT-4o, Claude-3.5-sonnet and DeepSeek-V3 on RewardBench.

Ablation Studies and Expert Analysis

We examine how each component affects performance and expert specialization.

Instead of adding a separate loss term to push tokens toward under‑used experts, the model directly normalizes expert probabilities during routing, ensuring a balanced load without extra supervision.

How does this differ from the classic auxiliary‑loss approach used in earlier MoE models?

The classic method adds a separate loss term that penalizes deviation from a uniform expert count, which introduces an extra gradient and a hyper‑parameter weight. The auxiliary‑loss‑free method performs a deterministic renormalization of the routing probabilities, so there is no extra loss term or weight to tune.

Table 4 shows that adding the Multi‑Token Prediction (MTP) objective consistently raises scores on most benchmarks, with the largest gains on HumanEval, GSM8K, and MATH.

Table 5 compares the auxiliary‑loss‑based and auxiliary‑loss‑free balancing strategies; the latter outperforms the former on almost every benchmark.

**Figure 9.** Expert load of auxiliary-loss-free and auxiliary-loss-based models on three domains in the Pile test set. The auxiliary-loss-free model shows greater expert specialization patterns than the auxiliary-loss-based one. The relative expert load denotes the ratio between the actual expert load and the theoretically balanced expert load. Due to space constraints, we only present the results of two layers as an example, with the results of all layers provided in Appendix C.

**Figure a.** Layers 1-7

**Figure.** Relative expert load across layers 7-12 for Aux-Loss-Based and Aux-Loss-Free models, comparing performance across Wikipedia (en), Github, and DM Mathematics datasets.

**Figure c.** Layers 13-19

Removing the auxiliary loss yields a cleaner, more specialized expert usage pattern and measurable accuracy gains.

Read the original paper

Open the simplified reader on Paperglide