DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang

DeepSeekMoE improves Mixture-of-Experts efficiency by segmenting experts into finer grains and isolating shared knowledge.

How can we improve Mixture-of-Experts (MoE) models by replacing coarse-grained experts with fine-grained ones and shared expert isolation?

Conventional Mixture-of-Experts (MoE) models struggle with knowledge hybridity and redundancy, as a limited number of experts are forced to learn diverse, overlapping information. DeepSeekMoE addresses this by splitting standard experts into smaller, fine-grained units to increase routing flexibility and isolating specific "shared" experts to capture common knowledge across all contexts. This architecture achieves comparable performance to dense models with significantly lower computational costs, reaching roughly 40% of the FLOPs required by equivalent dense baselines.

Paper Primer

Standard MoE architectures route tokens to a small set of experts, which often leads to "knowledge hybridity"—where one expert must learn too many unrelated concepts—and "knowledge redundancy"—where multiple experts redundantly learn the same common facts. DeepSeekMoE solves this by increasing the number of experts while reducing their individual size, allowing for more precise, specialized knowledge decomposition.

The core mechanism relies on two strategies: fine-grained expert segmentation and shared expert isolation. By splitting each expert into $m$ smaller units and activating $m$ times as many, the model gains massive combinatorial routing flexibility; simultaneously, it forces common knowledge into dedicated shared experts that are always active, preventing routed experts from wasting capacity on redundant information.

DeepSeekMoE 16B achieves performance comparable to dense 7B models while using only ~40% of the computational budget.

Evaluations across language modeling, reasoning, and code generation benchmarks (e.g., Pile, HellaSwag, HumanEval) show DeepSeekMoE 16B matches or exceeds LLaMA2 7B and DeepSeek 7B despite having significantly lower FLOPs per token. ~60% reduction in computational cost for equivalent performance.

Why does this approach outperform standard MoE architectures like GShard?

Standard MoE models suffer from limited expert specialization because tokens are routed to a small number of large, monolithic experts. DeepSeekMoE’s fine-grained segmentation allows the model to select a more precise combination of experts for any given token, while shared experts offload common knowledge, ensuring routed experts remain highly specialized.

What is the primary trade-off or limitation of this architecture?

The model exhibits limitations in multiple-choice tasks (like MMLU) compared to dense models of similar total parameter counts. The authors attribute this to the relatively small number of attention parameters compared to the FFN-heavy MoE layers, suggesting that attention capacity remains a bottleneck for certain reasoning tasks.

DeepSeekMoE demonstrates that MoE models can approach the theoretical performance upper bound of dense models by optimizing expert granularity and knowledge distribution. Researchers can now scale MoE models more efficiently by prioritizing expert specialization over simply increasing the number of monolithic experts.

Introduction

Framing the expert specialization problem that motivates DeepSeekMoE.

Mixture‑of‑Experts (MoE) promises to scale language models by activating only a subset of experts per token, thus keeping compute modest while expanding parameter count. In practice, conventional MoE layers activate the top‑$K$ out of $N$ experts, but this coarse activation limits how specialized each expert can become. The resulting knowledge hybridity and redundancy hinder the model from reaching the theoretical upper bound of MoE performance.

Standard MoE forces each activated expert to juggle many unrelated concepts, preventing the experts from developing focused, non‑overlapping knowledge.

Each expert stores its $128 \times 1024$ activation matrix, requiring $128 \times 1024 \times 4\text{ bytes} \approx 0.5\text{ MiB}$ of memory (assuming 32‑bit floats).

Activating $K=2$ experts per token multiplies this to $2 \times 0.5\text{ MiB} = 1\text{ MiB}$ per token batch.

For a modest batch of $B=32$ tokens, the total activation memory reaches $32 \times 1\text{ MiB} = 32\text{ MiB}$, already a sizable fraction of a typical GPU’s memory budget.

This toy calculation shows how even a small MoE configuration can consume tens of megabytes of activation memory, illustrating why coarse expert granularity quickly becomes a bottleneck at larger scales.

The shift from coarse‑ to fine‑grained expert specialization is the core advance of DeepSeekMoE.

Standard MoE Architecture

We describe the standard MoE layer that replaces FFNs with sparsely activated expert networks.

Standard Transformers stack identical blocks, each consisting of a self‑attention sublayer followed by a feed‑forward network, both with residual connections.

Instead of a single feed‑forward sublayer, a MoE layer routes each token to a few specialized FFN experts, keeping compute low while expanding capacity.

Apply softmax: s = [0.31,\;0.12,\;0.51,\;0.06] (s₍i,t₎ values).

Select top‑2 experts (indices 3 and 1) → keep s₍3,t₎ = 0.51, s₍1,t₎ = 0.31; set others to 0.

Gate values g become [0.31,\;0,\;0.51,\;0].

Each selected expert applies its FFN to the token vector \mathbf{h}^{t}_{\ell-1} (e.g., produce outputs e₁ = [0.8, 0.2] and e₃ = [0.5, 0.7]).

Combine: \mathbf{h}^{t}_\ell = 0.31·e₁ + 0.51·e₃ = [0.31·0.8+0.51·0.5,\;0.31·0.2+0.51·0.7] ≈ [0.55,\;0.44].

This toy example shows how the softmax‑top‑K gate yields a sparse weighted sum, preserving the token’s original information while limiting computation to K experts.

How does this MoE gating differ from a hard assignment that sends each token to a single expert?

Hard assignment would pick the single highest‑scoring expert and use its output unchanged, discarding the other scores. The softmax‑top‑K gate retains the relative strengths of the top K experts as weights g₍i,t₎, allowing a token to blend multiple expert transformations. This yields richer representations and smoother gradients during training.

DeepSeekMoE Architecture

DeepSeekMoE restructures experts to boost specialization without extra cost.

When a MoE layer has only a handful of experts, each expert must absorb many unrelated patterns, which hampers specialization.

Instead of a few large experts, we split every expert’s feed‑forward network into m tiny experts, then activate m of them per token, keeping total FLOPs unchanged.

How does this differ from simply increasing the total number of experts in a standard MoE?

Increasing N adds more parameters, raising memory and compute. Fine‑grained segmentation keeps the parameter count fixed by shrinking each expert’s hidden size, so the model gains combinatorial routing diversity without extra cost.

Original expert A’s FFN is partitioned into sub‑experts A₁ and A₂, each with hidden size 4.

Original expert B’s FFN is partitioned into sub‑experts B₁ and B₂, likewise.

For a given token, the router scores all 4 sub‑experts and picks the top 4 (so all are active).

Each sub‑expert computes its output $g_{i,t}$; the four outputs are summed to form the token’s final representation.

The total FLOPs equal those of the original 2 experts with hidden 8, because 2 × (4‑dim FFN) × 4 active sub‑experts = 2 × (8‑dim FFN) × 2 active experts.

Splitting preserves the overall compute budget while exploding the number of distinct expert‑combinations, enabling finer knowledge partitioning.

A small set of always‑active “shared experts” captures common knowledge, while the remaining routed experts focus on specialized patterns.

Why not simply increase K instead of adding dedicated shared experts?

Increasing K adds more routed experts per token, which raises compute. Shared experts are always active but counted once, so they provide common knowledge at negligible extra cost while preserving the original K budget for specialization.

Expert S (the shared expert) processes the token unconditionally.

The router scores the remaining 5 routed sub‑experts and selects the top 2 of them (K = 2).

Outputs from S, plus the two selected routed experts, are summed to produce the token’s final output.

Because S is shared, its parameters are updated by every token, efficiently learning generic features.

The routed experts receive only token‑specific gradients, encouraging specialization.

Deterministic shared processing removes redundancy among routed experts while keeping the per‑token compute identical to the baseline.

**Figure 2.** Illustration of DeepSeekMoE. Subfigure (a) showcases an MoE layer with the conventional top-2 routing strategy. Subfigure (b) illustrates the fine-grained expert segmentation strategy. Subsequently, subfigure (c) demonstrates the integration of the shared expert isolation strategy, constituting the complete DeepSeekMoE architecture. It is noteworthy that across these three architectures, the number of expert parameters and computational costs remain constant.

Load Balancing Strategy

Load imbalance can cripple MoE training, so we add expert‑ and device‑level balance losses.

When routing is learned automatically, two problems emerge: the model may collapse onto a few experts, and when experts span multiple devices the uneven token distribution creates compute stalls.

The trick is to penalize both per‑expert overload and per‑device overload, steering the routing toward a uniform spread without over‑constraining the model.

How does this load‑balancing differ from the classic MoE loss?

Classic MoE uses only an expert‑level term that forces each expert to see the same number of tokens. Here we add a device‑level term that groups experts by the hardware they run on, allowing the expert‑level term to be relaxed while still preventing any single device from becoming a bottleneck.

Count tokens per expert: $f_1 = 2$, $f_2 = 1$, $f_3 = 1$.

Compute $P_i = \frac{K'}{T}\times$(token count) = $\frac{3}{4}\times$(count) → $P_1=1.5$, $P_2=0.75$, $P_3=0.75$.

Expert‑level loss: $L_{\text{ExpBal}} = \alpha_1 \frac{1}{3}(f_1P_1 + f_2P_2 + f_3P_3) = \alpha_1 \frac{1}{3}(2\cdot1.5 + 1\cdot0.75 + 1\cdot0.75) = \alpha_1 \times 1.5$.

Assign experts to devices: $E_1=\{ \text{Expert 1},\text{Expert 2}\}$, $E_2=\{ \text{Expert 3}\}$ (so $|E_1|=2$, $|E_2|=1$).

Device‑level averages: $f'_1 = \frac{1}{2}(f_1+f_2)=\frac{1}{2}(2+1)=1.5$, $f'_2 = f_3 = 1$.

Device‑level loss: $L_{\text{DevBal}} = \alpha_2 \frac{1}{2}(1\cdot P'_1 f'_1 + 2\cdot P'_2 f'_2) = \alpha_2 \frac{1}{2}(1\cdot1.5\cdot1.5 + 2\cdot1\cdot1) = \alpha_2 \times 2.125$.

The example shows how the two losses jointly penalize an overloaded expert (Expert 1) and an overloaded device (Device 1), driving the routing toward a more uniform distribution.

Validation Setup

Key validation configuration and model scale for DeepSeekMoE.

DeepSeekMoE allocates expert parameters that are 16× larger than a standard FFN while keeping the overall model at roughly 2 B parameters.

Expert parameters equal 16× a standard FFN; total parameter count ≈2 B.

Compared to prior MoE baselines that rely on a single coarse‑grained expert, DeepSeekMoE’s fine‑grained expert segmentation and shared‑expert isolation enable a much richer expert pool without inflating the total parameter budget.

Validation Results and Ablations

Ablation and comparison results show DeepSeekMoE’s components drive its superior validation performance.

We evaluate on a suite of benchmarks: language modeling (Pile loss), language understanding and reasoning (HellaSwag, PIQA, ARC‑easy/challenge accuracy), reading comprehension (RACE‑high/middle accuracy), code generation (HumanEval, MBPP Pass@1), and closed‑book QA (TriviaQA, NaturalQuestions EM). Baselines include a dense Transformer (0.2 B parameters), Hash Layer and Switch Transformer (2.0 B total, 0.2 B activated), GShard (2.0 B total, 0.3 B activated), and our DeepSeekMoE (1 shared + 63 routed experts, total ≈2 B). Training on 100 B tokens, sparse MoE models beat the dense baseline, GShard slightly surpasses Switch, and DeepSeekMoE dominates with the same parameter budget.

**Figure 3.** Ablation studies for DeepSeekMoE. The performance is normalized by the best performance for clarity in presentation. All compared models have the same number of parameters and activated parameters. We can find that fine-grained expert segmentation and shared expert isolation both contribute to stronger overall performance.

**Figure 4.** Pile loss with regard to different ratios of disabled top routed experts. Notably, DeepSeekMoE exhibits greater sensitivity to the ratio of disabled top routed experts, indicating lower redundancy among routed experts in DeepSeekMoE.

**Figure 5.** Pile loss with regard to different numbers of activated routed experts in DeepSeekMoE. With only 4 routed experts activated, DeepSeekMoE achieves a Pile loss comparable with GShard.

**Figure 6.** Comparison between GShard and DeepSeekMoE with half the activated experts (trained from scratch). With the same total expert parameters and only half of the activated expert parameters, DeepSeekMoE still outperforms GShard.

Scaling to 16B

DeepSeekMoE 16 B delivers strong benchmark performance while using only a fraction of the compute of comparable dense models.

We scale DeepSeekMoE to $16\,\text{B}$ total parameters and train it on $2\,\text{T}$ tokens. The resulting model attains strong benchmark scores while consuming only a fraction of the compute required by comparable dense baselines.

DeepSeekMoE 16 B outperforms LLaMA2 7 B on the majority of benchmarks while using only about $40\%$ of the FLOPs.

Table 4 shows DeepSeekMoE 16 B achieving higher accuracy on most tasks despite $74.4\,\text{T}$ FLOPs per 4K tokens versus LLaMA2 7 B’s $187.9\,\text{T}$.

Despite having $16.4\,\text{B}$ total parameters, DeepSeekMoE 16 B activates only $2.8\,\text{B}$ parameters during inference.

Table 3 lists $2.8\,\text{B}$ activated parameters for DeepSeekMoE 16 B versus $0.75\,\text{B}$ for the dense DeepSeek 7 B.

**Figure 1.** Comparison between DeepSeekMoE 16B and open source models on the Open LLM Leaderboard. The red dashed line is linearly fitted from data points of all models except DeepSeekMoE 16B. DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B, which has approximately 2.5 times the activated parameters.

**Figure 7.** Benchmark curves during training of DeepSeekMoE 16B and DeepSeek 7B (Dense).

**Table 3.** Comparison between DeepSeek 7B and DeepSeekMoE 16B. Bold font indicates the best or near the best. With only 40.5% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B.

**Table 4.** Comparison between LLaMA2 7B and DeepSeekMoE 16B. With only 39.6% of computations, DeepSeekMoE 16B outperforms LLaMA2 7B on the majority of benchmarks.

Alignment for 16B

DeepSeekMoE Chat 16B matches dense 7 B models while using ~40% of the compute.

DeepSeekMoE Chat 16B uses roughly $40\%$ of the FLOPs of comparable $7\,$B dense chat models while matching their performance across most benchmarks.

Table 5 shows DeepSeekMoE Chat 16B at $74.4\,\text{T}$ FLOPs per $4\,\text{K}$ tokens versus $187.9\,\text{T}$ for LLaMA2 SFT 7B, with comparable accuracy scores.

We fine‑tuned DeepSeekMoE 16B on $1.4\,$M in‑house bilingual examples (English and Chinese) using supervised fine‑tuning with a batch size of $1024$, $8$ epochs, AdamW optimizer, a constant learning rate of $1\times10^{-5}$, and a maximum sequence length of $4\,\text{K}$ tokens.

**Table 1.** Comparison of model performance across various benchmarks.

Across language understanding, reasoning, reading comprehension, and mathematics benchmarks, DeepSeekMoE Chat 16B matches the dense $7\,$B models despite its lower compute. It notably surpasses LLaMA2 SFT 7B on code generation tasks (HumanEval and MBPP) and exceeds it on all Chinese benchmarks, while trailing DeepSeek 7B on multiple‑choice QA but with a narrowed gap after fine‑tuning.

Scaling to 145B

DeepSeekMoE 145B matches dense‑model quality while using a fraction of the compute.

DeepSeekMoE 145B surpasses GShard 137B by a measurable margin on the full benchmark suite while using a similar parameter budget.

Table 6 shows higher average scores for DeepSeekMoE 145B across language‑modeling and knowledge‑intensive tasks.

Scaling from 16 B to 145 B required only a modest increase in training tokens (245 B) and leveraged the same tokenization pipeline, demonstrating that the architecture scales without redesign.

**Table.** Comparison of model performance and architecture metrics across DeepSeek 67B (Dense), GShard 137B, DeepSeekMoE 145B, and DeepSeekMoE 142B (Half Activated).

Related Work

We survey prior MoE architectures and position GShard as the baseline.

GShard partitions a model into many expert shards and routes each token to the top‑1 or top‑2 experts, enabling massive scaling while keeping per‑token compute low.

Introduced the MoE framework where independent expert modules handle different data samples, enabling specialization.

Extended the MoE idea to handle different samples with independent expert modules, emphasizing modularity.

Integrated MoE layers into large‑scale LSTM language models, showing that sparse expert activation scales language modeling.

Scaled MoE to extremely large models using learnable top‑2 routing, establishing a strong baseline for sparse‑activation scaling.

Introduced a top‑1 routing strategy that activates a single expert per token, simplifying the routing logic while retaining scalability.

Uses a fixed hash function for routing, providing deterministic expert assignment and improved training stability.

Combines fixed routing with additional regularization to achieve stable training of large MoE models.

Allows each token to be assigned to a variable number of experts, increasing flexibility in expert utilization.

Addresses training instability and fine‑tuning difficulty in MoE models by introducing stabilization techniques.

Recent models build on existing MoE architectures to scale language and multimodal tasks to billions of parameters.

Appendix

Key configuration tables and comparative results for DeepSeekMoE.

This appendix collects the concrete settings used for each DeepSeekMoE configuration and places the model side‑by‑side with larger MoE and dense baselines.

The table provides architectural and training hyperparameters for three different model configurations, detailing parameters, layers, hidden size, attention heads, shared experts, routed experts (with activation counts), relative expert size, sequence length, batch size, and learning rate.

**Table 8.** Comparison between DeepSeekMoE and larger GShard models.

**Table 9.** Comparison between DeepSeekMoE and larger dense baselines.

Read the original paper

Open the simplified reader on Paperglide