Mixtral of Experts
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed
Mixtral 8x7B is a sparse mixture-of-experts model that outperforms Llama 2 70B using only 13B active parameters.
How does a Sparse Mixture of Experts (SMoE) architecture allow a model to achieve the performance of a much larger dense model while maintaining the inference speed of a smaller one?
Large language models face a trade-off between model capacity and inference cost, as increasing parameter counts typically forces a linear increase in compute per token. Mixtral 8x7B addresses this by replacing standard feedforward blocks with sparse mixture-of-experts layers: for every token, a router selects only two of eight available expert networks to process the input. This architecture allows the model to maintain a large total parameter count of 47B while using only 13B active parameters per token, resulting in performance that matches or exceeds Llama 2 70B and GPT-3.5 across major benchmarks.
Paper Primer
The core mechanism is a decoder-only transformer where each feedforward sub-block is replaced by a sparse mixture-of-experts layer. A gating network computes a softmax over the top-2 logits for each token, routing the input to two distinct SwiGLU expert networks whose outputs are then summed.
Mixtral 8x7B achieves superior performance on reasoning, math, and code benchmarks compared to Llama 2 70B.
On the MMLU benchmark, Mixtral scores 70.6% compared to 69.9% for Llama 2 70B, while using 5x fewer active parameters. Significant gains are observed in code (HumanEval: 40.2% vs 29.3%) and math (GSM8K: 74.4% vs 69.6%).
The instruction-tuned variant, Mixtral 8x7B-Instruct, utilizes supervised fine-tuning and Direct Preference Optimization (DPO). It surpasses GPT-3.5 Turbo, Claude-2.1, and Gemini Pro on human evaluation benchmarks like MT-Bench.
Why use a sparse mixture-of-experts approach instead of simply scaling a dense model?
Sparse models decouple the total parameter count from the computational cost per token. This allows the model to benefit from the knowledge capacity of 47B parameters while maintaining the inference speed and throughput of a much smaller 13B parameter model.
Does the router show specialized behavior, such as assigning specific experts to specific topics?
No; the authors observed no obvious domain-specific expert assignment. The router appears to exhibit structured syntactic behavior, often routing consecutive tokens or specific functional tokens (like indentation in code) to the same experts.
Mixtral demonstrates that high-performance models can be served efficiently by leveraging sparse computation, effectively shifting the bottleneck from raw compute to memory bandwidth and expert-routing overhead.
Introduction
We expose the inference bottleneck of dense LLMs and propose sparse mixture of experts to cut active compute.
Dense language models grow compute proportionally to their parameter count, so scaling to hundreds of billions of parameters makes inference prohibitively slow. This scaling tension motivates a redesign of the compute pathway. Mixtral tackles the problem by decoupling model capacity from the amount of compute that runs for each token.
Instead of a single monolithic feed‑forward block, each transformer layer contains several expert blocks, and a lightweight router activates only a few of them for each token, letting the model keep a huge parameter pool while spending far less compute per token.
All experts together contain $4 \times 5 = 20$ parameters (the “total” pool).
The router activates two experts, so the token uses $2 \times 5 = 10$ parameters (the “active” set).
If the token processes a sequence of three tokens, each may select a different pair, yielding up to $3 \times 10 = 30$ active‑parameter usages while the total pool remains 20.
Even a tiny toy model shows how the active‑parameter count stays low while the total capacity can be much larger, illustrating the core efficiency gain of SMoE.
Mixtral is trained with a 32 k token context, enabling it to retrieve information from far‑away positions. On a suite of benchmarks it matches or surpasses Llama 2 70 B and GPT‑3.5, especially on mathematics, code generation, and multilingual tasks. The instruction‑tuned variant, built with supervised fine‑tuning and Direct Preference Optimization, further outperforms leading chat models on human evaluations.
The shift from dense to sparse compute lets large‑capacity LLMs run with a fraction of the active parameters.
The image displays a list of names followed by the logo for "Mistral AI".
The Mixtral Architecture
Mixtral replaces dense feed‑forward blocks with a sparse expert router that limits per‑token compute.
Scaling a dense transformer to a 32 k token context would make inference prohibitively slow; Mixtral solves this by activating only a handful of expert sub‑networks for each token.
**Figure 1: Mixture of Experts Layer.** Each input vector is assigned to 2 of the 8 experts by a router. The layer's output is the weighted sum of the outputs of the two selected experts. In Mixtral, an expert is a standard feedforward block as in a vanilla transformer architecture.
The router decides which experts will process a token by scoring all experts, keeping only the top K, and turning those scores into a probability distribution.
Compute raw logits $xW_g = [1.2, 0.5, 2.0, 0.3]$.
Top‑2 keeps the entries $2.0$ (expert 2) and $1.2$ (expert 0); the others become $-\infty$.
Apply Softmax to $[2.0, -\infty, 1.2, -\infty]$ → weights $[0.73, 0, 0.27, 0]$ (approximately).
Only experts 0 and 2 are evaluated; their outputs $E_0(x)$ and $E_2(x)$ are multiplied by $0.73$ and $0.27$ respectively and summed.
The router’s top‑K step guarantees that the cost per token depends only on $K$, not on the total number of experts $n$.
“Total parameters” count every weight in the whole model (including all experts), while “active parameters” count only the weights actually used for a given token.
In Mixtral each transformer block swaps the standard feed‑forward sub‑layer for a MoE layer that uses two SwiGLU experts (K = 2) per token.
Compared with GShard, Mixtral keeps the overall transformer structure unchanged, replaces every FFN with an MoE layer, and uses the simpler top‑K softmax routing.
Performance Benchmarks
Mixtral matches or exceeds LLaMA 2 70B while using five‑times fewer active parameters.
Mixtral 8×7B matches or outperforms LLaMA 2 70B on all benchmarks except reading‑comprehension, while using five‑times fewer active parameters.
Table 2 shows Mixtral’s $70.6\%$ MMLU versus LLaMA 2 70B’s $69.9\%$, and its $13\text{B}$ active parameters versus $70\text{B}$.
LLaMA 2 is a dense transformer family where every layer processes every token with the full parameter set, scaling compute linearly with model size.
GPT‑3.5 is OpenAI’s instruction‑tuned large language model, widely used as a commercial performance reference.
**Figure 2.** Performance of Mixtral and different Llama models on a wide range of benchmarks. All models were re-evaluated on all metrics with our evaluation pipeline for accurate comparison. Mixtral outperforms or matches Llama 2 70B on all benchmarks. In particular, it is vastly superior in mathematics and code generation.
Mixtral delivers LLaMA 2 70B‑level quality at a fraction of the active‑parameter cost.
Extended Capabilities
Dense models grow linearly; Mixtral’s sparse MoE keeps active compute low while matching large models.
The central premise remains: dense scaling is compute‑heavy, while Mixtral’s Sparse Mixture of Experts activates only a fraction of parameters per token.
Mixtral attains perfect (100 %) passkey retrieval accuracy across all tested sequence lengths.
Figure 5 (left panel) shows 100 % accuracy for lengths up to 28 K tokens.
Maintaining accuracy at extreme lengths is crucial for retrieval‑augmented and reasoning tasks that require hundreds of thousands of tokens.
**Figure 4.** Long range performance of Mixtral. (Left) Mixtral has 100% retrieval accuracy of the Passkey task regardless of the location of the passkey and length of the input sequence. (Right) The perplexity of Mixtral on the proof-pile dataset decreases monotonically as the context length increases.
We also increase the share of multilingual data during pre‑training, allowing Mixtral to excel on non‑English benchmarks while preserving English performance.
Beyond multilingual ability, we evaluate bias using BBQ and BOLD; Mixtral shows reduced bias relative to Llama 2 70B.
Alignment and Expert Analysis
We evaluate Mixtral‑Instruct’s fine‑tuning and examine its routing behavior.
The paper’s core idea is that a sparse mixture of experts decouples model capacity from the amount of compute active at inference, letting a modest number of active parameters deliver large‑model quality.
SFT aligns the raw MoE model to follow explicit instruction‑response pairs, giving it a usable conversational baseline.
DPO refines the SFT model using a paired feedback set, directly maximizing the likelihood of preferred responses without a separate reward model.
Mixtral‑Instruct attains an MT‑Bench score of 8.30, the highest among open‑weights models as of December 2023.
Human evaluation on the MT‑Bench suite reported this number for the fine‑tuned model.
Mixtral uses only 13 B active parameters per token while outperforming Llama 2 70B, which consumes 70 B active parameters per token.
Parameter counts are reported in the model specifications; performance comparison is based on human‑eval benchmarks.
**Table 5.** Bias Benchmarks. Compared Llama 2 70B, Mixtral presents less bias (higher accuracy on BBQ, lower std on BOLD) and displays more positive sentiment (higher avg on BOLD).
**Figure 6.** LMSys Leaderboard. (Screenshot from Dec 22, 2023) Mixtral 8x7B Instruct v0.1 achieves an Arena Elo rating of 1121 outperforming Claude-2.1 (1117), all versions of GPT-3.5-Turbo (1117 best), Gemini Pro (1111), and Llama-2-70b-chat (1077). Mixtral is currently the best open-weights model by a large margin.
We next probe the router’s expert‑selection patterns across three depths of the model.
**Figure 7.** Proportion of tokens assigned to each expert on different domains from The Pile dataset for layers 0, 15, and 31. The gray dashed vertical line marks 1/8, i.e. the proportion expected with uniform sampling. Here, we consider experts that are either selected as a first or second choice by the router. A breakdown of the proportion of assignments done in each case cane be seen in Figure 9 in the Appendix.
**Table 5.** Percentage of expert assignment repetitions. We evaluate the proportion of times the same expert is assigned to a token $i$ and its following token $i+1$. We report whether the first chosen expert is the same, or whether the same expert is observed as first or second choice in consecutive tokens. For reference, the expected proportion of repetitions in the case of random assignments is $\frac{1}{8} = 12.5\%$ for "First choice" and $1 - \frac{6}{8} \cdot \frac{5}{7} \approx 46\%$ for "First and second choice". Repetitions at the first layer are close to random, but are significantly higher at layers 15 and 31. The high number of repetitions shows that expert choice exhibits high temporal locality at these layers.
**Figure 8.** Text samples where each token is colored with the first expert choice. The selection of experts appears to be more aligned with the syntax rather than the domain, especially at the initial and final layers.
**Figure 9.** Proportion of tokens assigned to each expert on different subsets from The Pile dataset, separated by whether the expert was selected as first or second choice, or either. The “Either choice” case is equivalent to Figure 7. The gray dashed vertical line marks 1/8, i.e. the proportion expected with uniform sampling.
**Figure 10.** Repeated consecutive assignments per MoE layer. Repeated assignments occur a lot more often than they would with uniform assignments (materialized by the dashed lines). Patterns are similar across datasets with less repetitions for DM Mathematics.
In summary, Mixtral 8x7B demonstrates that a sparsely activated MoE can match or exceed dense‑model performance while using far fewer active parameters, and its routing dynamics exhibit syntactic consistency and temporal locality that can be harnessed for efficient inference.