The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu

A 229.9B-parameter Mixture-of-Experts model optimized for agentic workflows via large-scale, verifiable data pipelines.

How can a sparse Mixture-of-Experts model achieve frontier-level reasoning and agentic performance while keeping activated parameters low?

Large language models struggle to maintain efficiency and reliability when transitioning from simple chat to long-horizon agentic tasks like production coding and office automation. The MiniMax-M2 series addresses this by using a fine-grained Mixture-of-Experts architecture that activates only 9.8B parameters per token, paired with an agent-native reinforcement learning system that decouples training, inference, and agent logic. The flagship M2.7 model achieves frontier-tier performance on complex agentic benchmarks while maintaining a significantly smaller active compute footprint than dense models.

Paper Primer

The M2 architecture relies on a 62-layer decoder-only Transformer with 256 fine-grained experts and sigmoid gating, which removes the zero-sum constraint of traditional softmax routing. This design allows multiple experts to activate simultaneously, improving load balancing and reducing reliance on auxiliary losses.

To support agentic deployment, the authors built Forge: an agent-native reinforcement learning system. Forge uses windowed-FIFO scheduling and prefix-tree merging to handle long-horizon trajectories, while decoupling the training loop from the agent's specific architecture to support both white-box and black-box models.

M2.7 achieves frontier-tier agentic performance with a sparse activation footprint.

Benchmark scores include 56.2 on SWE-bench Pro, 94.2 on AIME 2026, and 89.8 on GPQA-Diamond. The model delivers these results while activating only ~10B parameters per token out of a 229.9B total parameter count.

The series demonstrates early self-evolution in the M2.7 checkpoint, where the model autonomously triages failed training runs and modifies its own agent scaffold, effectively automating a significant bottleneck in frontier model development.

Why does the paper prioritize full attention over hybrid or sparse attention mechanisms?

The authors found that while hybrid attention variants performed well on shorter tasks, they consistently degraded in performance on long-context retrieval and multi-hop reasoning, which are critical for agentic reliability.

How does the data pipeline ensure the quality of agentic trajectories?

The pipeline uses "Agent-as-a-Verifier" (AaaV) and test-based reward construction, where trajectories are only accepted if they pass execution-layer checks in sandboxed environments or meet multi-axis rubrics for reasoning and aesthetics.

The M2 series demonstrates that agentic capability is less about raw parameter count and more about the quality of verifiable, environment-grounded data pipelines and stable RL infrastructure.

Introduction

We frame the efficiency and capability challenges that motivate the MiniMax‑M2 series.

Large language models are moving from single‑turn dialogue to long‑horizon agentic workflows—coding, web navigation, and office‑task automation. This shift exposes two difficulties: (1) ultra‑long contexts make training and inference prohibitively expensive, and (2) real‑world deployment requires solving complex, high‑stakes problems such as production‑grade software engineering.

A sparse Mixture‑of‑Experts model that activates only a tiny fraction of its parameters per token, keeping per‑token compute low while scaling overall capacity.

How does MiniMax‑M2 differ from a standard MoE that activates many experts per token?

Standard MoEs often activate dozens of experts, incurring near‑dense compute. MiniMax‑M2’s sigmoid gating limits activation to a handful, so the per‑token FLOP count stays constant even as the total expert pool scales to billions of parameters.

Total parameters = 10 × 1 B = 10 B.

Gate selects 2 experts → activated parameters = 2 × 1 B = 2 B.

Activation ratio = 2 B / 10 B = 0.2 (20 %).

Even with a modest expert pool, the activation fraction stays low, illustrating how MiniMax‑M2 keeps per‑token compute bounded while the overall model size grows.

The series rests on three pillars: (i) agent‑driven data pipelines that generate verifiable trajectories for coding and cowork tasks; (ii) Forge, a scalable RL system with windowed‑FIFO scheduling, prefix‑tree merging, and a clean separation of training, inference, and agent; and (iii) the M2.7 checkpoint, which autonomously debugs training runs and rewrites its own scaffold.

The core philosophy is that mini activations unlock maximal real‑world intelligence.

Architecture and Design

We detail the MoE feed‑forward and the Multi‑Token Prediction module that enable low per‑token compute while scaling capacity.

Scaling language models traditionally inflates the compute each token requires, limiting throughput. Our design tackles this by sparsifying the feed‑forward layer with a Mixture‑of‑Experts and by adding a Multi‑Token Prediction objective that amortizes computation across several future tokens.

Think of the feed‑forward layer as a team of specialist subnetworks; each token consults only a few experts, keeping the work per token small while the overall model capacity grows with the number of experts.

Compute sigmoid scores $s_i = \sigma(w_i^\top x + b_i)$ for each expert $i\in\{1,\dots,4\}$.

Select the two highest scores (e.g., experts 2 and 3) as the active set.

Pass the token representation through the two selected experts, producing two $64$‑dim outputs.

Average the two outputs (or concatenate and project) to obtain the final MoE feed‑forward result.

Update the bias terms $b_i$ via gradient descent; experts that are under‑used receive a small upward push.

Even with only two active experts, the combinatorial routing ($\binom{4}{2}=6$ possible pairs) yields richer expressivity than a single monolithic feed‑forward layer.

How does sigmoid gating differ from the classic top‑$k$ softmax gating used in earlier MoE models?

Top‑$k$ softmax forces the scores to sum to one, so increasing one expert’s score necessarily depresses the others. Sigmoid gating gives each expert an independent probability, allowing all $k$ experts to be high‑confidence simultaneously and avoiding the zero‑sum competition that can cause routing instability.

Instead of predicting a single next token, the model predicts a short horizon of tokens in one forward pass, sharing the same hidden states so the cost of the transformer layers is amortized over multiple predictions.

Why copy‑initialize the hidden state from one MTP module to the next instead of learning separate parameters for each step?

Copy‑initialization guarantees that each speculative step begins from the exact representation the main model would have produced after generating the preceding token, preserving coherence across the predicted horizon while reusing the same parameters, which keeps the parameter count low.

**Figure 2.** Multi-Token Prediction (MTP) module architecture used in M2.

**Table.** Model configuration and benchmark results comparing Baseline, w/ MTP, and w/ Fine-Grained approaches.

Pre-Training and Data

Ablations show SWA modestly improves some benchmarks but hurts performance on long‑context tasks.

SWA keeps a running average of model weights across training steps, smoothing the loss landscape and yielding more robust generalization.

How does SWA differ from a naïve arithmetic mean of checkpoints?

Naïve averaging treats all checkpoints equally, which can re‑introduce early, poorly‑trained weights. SWA uses an exponential decay so that recent, higher‑quality weights dominate, preserving the benefits of later training while still smoothing.

On the RULER 128K CWE benchmark, adding SWA drops accuracy from 90.0 % to 72.0 %.

Table 2 shows the baseline score 90.0 versus 72.0 with SWA.

On the MTOB K‑e BLEURT benchmark, SWA reduces the score from 60.0 to 45.0.

Table 2 reports 60.0 for the baseline and 45.0 with SWA.

HELMET ICL performance falls by 3.1 points when SWA is applied (75.8 → 72.7).

Table 2 baseline 75.8, SWA 72.7.

MMLU sees a modest gain of 0.1 points with SWA (85.5 → 85.6).

Table 2 baseline 85.5, SWA 85.6.

**Table.** Comparison of performance between the Baseline and the model with SWA (Stochastic Weight Averaging) across various benchmarks.

The table compares the performance of a "Baseline" model against a model "w/ SWA" across various "General Benchmarks" and "Agent Benchmarks".

The pre‑training corpus totals 19.9 T tokens in the constant phase, with an additional 9.3 T tokens during the decay phase that progressively extends the context window from 8 K to 192 K tokens using high‑quality code concatenations and long‑form PDFs.

Application Development Tasks

We describe the expert-in-the-loop pipeline that creates and verifies full application tasks, and the Terminal‑Gym synthesis system.

Building full applications from scratch demands runtime verification and design quality beyond static code checks, so we introduce an expert‑in‑the‑loop pipeline that synthesizes tasks, samples trajectories, and validates them with an Agent‑as‑a‑Verifier.

Domain experts write meta‑query templates; the system expands them into concrete development tasks, generates execution trajectories, and discards failures via a three‑layer verifier.

How does this pipeline differ from a plain LLM‑as‑a‑judge approach?

The pipeline adds three verification layers that actually execute the generated code, interact with it, and assess visual design, whereas a plain LLM‑as‑a‑judge only scores static snippets or screenshots without running the program.

Each task is sent to the LLM with a high‑temperature prompt, producing a candidate application.

The AaaV sandbox attempts to build and run each candidate; two fail the Execution Layer (syntax error, missing dependency).

The remaining two pass Execution, then the Interaction Layer checks for a clickable “Submit” button; one fails because the button is missing.

The sole survivor proceeds to the Visual Aesthetics Layer, where a rubric scores layout professionalism; it receives a passing score.

The three‑layer verifier quickly discards non‑functional code while preserving only fully runnable, interactive, and well‑designed applications.

We turn curated Stack Overflow posts into verifiable terminal tasks by generating Docker environments, tests, and difficulty‑calibrated variants.

Stage 1: For each post we generate a Dockerfile (installing coreutils for (1) and python‑venv for (2)) and a test script that runs `ls` or `pip install` respectively; both tests pass.

Stage 2: We strip the explicit command `ls -la` from the first task, leaving only “list directory contents”, and replace the concrete package name in the second task with a generic placeholder.

Stage 3: We evaluate a reference solver on the abstracted tasks; the first task has a 70 % pass rate, the second 45 %. Both are kept because they are below the 80 % threshold, indicating higher difficulty.

By iteratively abstracting hints and filtering on solver performance, the pipeline yields terminal tasks that remain challenging even after removing surface cues.

**Figure 3.** The agentic coding data pipelines for SWE and AppDev tasks.

Agentic Cowork

Agentic Cowork defines a unified pipeline for collecting professional‑task data across diverse domains.

Real‑world deployment demands agents that can move between heterogeneous professional environments—searching the open web, handling spreadsheets, drafting slides, and producing office artifacts. Existing pipelines either focus on a single tool or rely on generic judges that cannot enforce domain‑specific quality, leading to brittle behaviours and costly manual curation.

A single, domain‑agnostic pipeline that turns real workspaces into training tasks, distills teacher trajectories, and validates outputs with artifact‑aligned signals.

How does Agentic Cowork differ from a generic multi‑task data‑collection pipeline?

Generic pipelines usually treat every task as a text‑only prompt and rely on a single, often heuristic, judge. Agentic Cowork ties each task to a concrete, executable workspace, uses domain‑specific verification (e.g., spreadsheet formula correctness), and applies a two‑stage selection (pairwise comparison + rubric) for non‑verifiable outputs, ensuring that the collected data respects the artifact’s structural constraints.

Domain 1 (spreadsheet): Teacher A fills cell B1 with “42”, Teacher B fills B1 with “41”. Both trajectories are recorded.

Domain 2 (slide): Teacher A adds a title “Q1 Results”, Teacher B adds a title “Quarter 1 Summary”. Both trajectories are recorded.

Verification: Spreadsheet acceptance checks that B1 contains a numeric value; both candidates pass. Slide acceptance checks that the title field is non‑empty; both pass.

Pairwise comparison: For each domain we ask a weak evaluator to rank the two trajectories on “reasoning‑and‑action coherence”. The evaluator prefers Teacher A for the spreadsheet (exact match) and Teacher B for the slide (more descriptive).

Rubric filtering: A final rubric requires the spreadsheet to match a target value (±1) and the slide title to be ≤ 5 words. Both selected candidates satisfy the rubric, so they are added to the training corpus.

The example shows that the same pipeline can handle both automatically verifiable artifacts (numeric spreadsheet cells) and artifacts that need human‑style ranking (slide titles), while still producing a single, high‑quality training example per domain.

Deep Search and Open‑Web Research applies the pipeline to web browsing. Starting from a seed question, we iteratively rewrite and obscure entities to control difficulty, then require the retrieved answer to be grounded in actual web evidence rather than model memory.

Knowledge‑Worker Office Tasks extends the pipeline to end‑to‑end professional deliverables. We anchor the corpus to the GDPval benchmark, hierarchically synthesize tasks across occupational categories, and enforce a multi‑axis rubric that checks factuality, regional appropriateness, and depth of reasoning.

Financial Analysis and Spreadsheet Operations uses two parallel pipelines. The first inverts the usual authoring order: we first run real financial tools, then reverse‑derive tasks that are guaranteed to be executable. The second walks through a workbook, records atomic operations, and synthesizes tasks from the resulting trajectories.

Slide Generation and Editing treats deck creation as an open‑ended generation problem and slide editing as a localized intervention problem. We curate source documents, generate diverse queries, and validate outputs with a combination of execution success, functional correctness, rule‑based layout checks, and a visual scorer.

Reinforcement Learning Algorithm

6. Reinforcement Learning

6.1. RL Algorithm

6.1.1. Agent RL Modeling

We formulate agent reinforcement learning by treating the LLM as a policy and everything outside the model’s generation process—including context management, memory access, and agent state transition—as the environment. This separation provides a clean abstraction that naturally extends the standard RL framework to accommodate the complexity of agentic systems.

6.1.2. MDP Formulation

We model the agent‑environment interaction as a Markov Decision Process $M = (S, A, T, R, \gamma)$. At each step $t$, the agent observes a state $s_t \in S$—comprising the current context window content, including the task instruction, prior conversation history, tool outputs and any artifacts produced during the agent loop—and produces an action $a_t \in A$, defined as a single‑step LLM completion. This completion may contain natural language reasoning, a tool invocation request, an explicit context management operation, a communication with a sub‑agent, or any combination thereof. The environment then executes the requested operations and returns an observation $o_t$, which, together with possible context management operations, determines the next state:

$$ s_{t+1} = f_{\text{trans}}(s_t, a_t, o_t), $$

where $f_{\text{trans}}$ denotes an arbitrary state transition function that may change the accumulated context and the internal state of the agent loop. The trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_T, a_T)$ constitutes a complete episode, and the policy $\pi_\theta(a_t \mid s_t)$ is parameterized by the LLM weights $\theta$.

A key design principle is that the environment boundary is drawn at the model’s generation interface. All components that process, transform, or respond to the model’s outputs are treated as part of the environment dynamics:

- **Tool Environments:** external tool execution (code interpreters, search engines, APIs) that returns structured observations in response to tool‑call actions. - **Agent Harness:** the harness‑level control flow that governs how the agent proceeds between LLM calls—including context management, branching logic, sub‑agent delegation, and interactions with external modules.

6.1.3. Training Objective

A key consequence of this modeling is that $\pi_\theta$ is not required to explicitly reason about or control the environment and state transitions. Training operates on individual $(s_t, a_t)$ pairs as atomic units: each pair constitutes a single training sample for the policy gradient. This decouples the policy from the mechanics of state evolution—the model need not be aware of whether $s_t$ resulted from a simple message append, an aggressive context truncation, or a complete history rewrite. Meanwhile, credit assignment, advantage estimation, and reward propagation can still be performed at the episode level over the full trajectory $\tau$, ensuring that the contribution of each $(s_t, a_t)$ pair is evaluated in the context of the overall task outcome.

6.1.4. Policy Optimization

**CISPO.** We adapt Clipped Importance Sampling Policy Optimization (CISPO) (MiniMax, 2025a) to M2 series RL training. The objective function is:

$$ J_{\text{CISPO}}(\theta) = \mathbb{E}_{(q,a)\sim D,\{o_i\}} \frac{1}{\sum_i |o_i|} \sum_i \sum_{t=1}^{|o_i|} \operatorname{sg}\!\big(\hat r_{i,t}(\theta)\big) \hat A_{i,t} \log \pi_\theta\big(o_{i,t} \mid q, o_i,<t\big), $$

where $G$ is the number of rollout trajectories per prompt, $|o_i|$ is the token length of trajectory $i$, and $\operatorname{sg}(\cdot)$ denotes the stop‑gradient operator that prevents gradient flow through the importance weight.

The importance sampling ratio is clipped asymmetrically:

$$ \hat r_{i,t}(\theta) = \operatorname{clip}\!\left( \frac{\pi_\theta(o_{i,t} \mid q, o_i,<t)}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_i,<t)},\; 0,\; 1 + \epsilon_{\text{IS}}^{\text{high}} \right), $$

where the upper bound $1 + \epsilon_{\text{IS}}^{\text{high}}$ prevents excessively large policy updates, while the zero lower bound permits aggressive down‑weighting of actions that become improbable under the current policy.

The stop‑gradient on the clipped ratio ensures that the importance weight modulates the gradient magnitude without introducing second‑order terms, yielding a stable first‑order update rule.

The advantage estimate is computed via reward‑to‑go with a trajectory‑level baseline:

$$ \hat A_{i,t} = \sum_{p=t}^{T} r_p - B_i, $$

where $r_p$ is the composite reward at step $p$ (defined below) and $B_i$ is the baseline computed over trajectory $i$ for variance reduction.

6.1.5. Reward Design

Standard outcome‑based rewards are insufficient for credit assignment in agent trajectories that may span up to 192 K tokens with thousands of intermediate actions. We design a composite reward framework with three components.

**Process Reward.** We assign dense, intermediate rewards that target specific behavioral patterns throughout the trajectory, including penalties for language mixing and tool invocation format errors, and rewards for well‑structured intermediate reasoning steps. These process rewards provide fine‑grained supervisory signal at each $(s_t, a_t)$ pair, substantially improving credit assignment granularity over sparse outcome‑only feedback.

**Task Completion Time Reward.** Traditional RL objectives optimize solely for correctness, neglecting execution efficiency. For agentic tasks, functionally equivalent trajectories may differ dramatically in wall‑clock latency due to sequential versus parallel tool execution and sub‑agent invocation overhead. We incorporate relative completion time as an explicit reward:

$$ r^{\text{speed}}_t = h\!\left( \frac{T_{\text{completion}}}{T_{\text{baseline}}} \right), $$

where $h(\cdot)$ is a monotonically decreasing shaping function, $T_{\text{completion}}$ is the wall‑clock time taken by the rollout, and $T_{\text{baseline}}$ is a reference completion time. This incentivizes the policy to discover and exploit parallelism opportunities, producing solutions that are both correct and efficient.

**Reward‑to‑Go with Baseline.** To reduce gradient variance in long‑horizon tasks, we adopt a reward‑to‑go formulation:

$$ G_t = \sum_{\tau=t}^{T} \gamma^{\tau-t} r_\tau, $$

combined with the trajectory‑level baseline. This formulation concentrates gradient signal on actions whose consequences are not yet accounted for, improving credit assignment precision and stabilizing the optimization.

The composite reward at each step is:

$$ r_t = \alpha \, r^{\text{process}}_t + \beta \, r^{\text{speed}}_t + r^{\text{perf}}_t, $$

where $\alpha$ and $\beta$ are coefficients balancing dense behavioral feedback and efficiency incentives against the primary task performance signal.

6.1.6. Mixed‑Domain RL Training

A critical challenge in training general‑purpose agents is avoiding the trade‑off between task‑specific optimization and broad capability preservation. Single‑domain RL training—fine‑tuning exclusively on agentic tasks—risks catastrophic forgetting of the model’s foundational reasoning and general knowledge capabilities. Conversely, sequential multi‑stage training across domains induces negative transfer, as gains in one domain erode performance in previously trained domains.

We adopt a mixed‑domain RL training strategy that addresses both issues. Training proceeds through multiple stages, and within each stage, training data is drawn simultaneously from four domains: reasoning, coding, agent, and general. This joint optimization ensures that the policy gradient updates are informed by a diverse task distribution at every training step, preventing the optimizer from overfitting to any single domain’s reward landscape.

Across stages, we systematically adjust three axes:

- **Domain mixing ratios.** The relative proportions of data from each domain are tuned per stage. Early stages emphasize foundational capabilities (reasoning and general domains) to consolidate the model’s base competence, while later stages progressively increase the proportion of agent and coding tasks to sharpen task‑specific performance.

- **Context length.** We expand the maximum context length at a per‑domain granularity across stages. This curriculum‑style progression enables the model to first master short‑horizon decision‑making before extending to the long‑context trajectories characteristic of complex agent tasks.

- **Difficulty distribution.** Within each domain, the difficulty distribution of training tasks shifts progressively toward harder instances. Early stages include a broad mix to establish robust foundations, while later stages concentrate on challenging scenarios that push the policy’s frontier.

This mixed‑domain strategy yields compounding benefits: it simultaneously improves the model’s foundational reasoning ability, task‑specific quality across all target domains, and end‑to‑end user experience—since agents deployed in practice encounter a heterogeneous mix of requests that spans all four domains.

Forge RL Infrastructure

The MiniMax‑M2 series keeps per‑token compute low while scaling capacity, but training agents still faces the RL impossible triangle.

The MiniMax‑M2 series uses a sparse Mixture‑of‑Experts to keep per‑token compute low while scaling capacity, enabling strong agentic reasoning. To train such agents at scale we must juggle throughput, stability, and flexibility—the RL impossible triangle.

Three goals—maximizing raw token throughput, bounding policy‑gradient variance, and supporting arbitrary agent architectures—pull in opposite directions, so improving any one typically harms another.

How does this “impossible triangle” differ from the classic compute‑memory‑accuracy trade‑off?

The classic trade‑off balances hardware resources against model quality, whereas the RL triangle balances three *dynamic* properties of the training pipeline: raw token processing speed, statistical stability of policy updates, and the ability to plug in any agent design without code changes.

Forge splits the system into three independent modules—Agent Side, Middleware Abstraction Layer, and Training/Inference Side—so each can scale without pulling the others down.

Why does separating the agent and training sides improve throughput?

Read the original paper

Open the simplified reader on Paperglide