Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

Manifold Power Iteration aligns Mixture-of-Experts routers with expert features for faster, more stable training.

How can we improve the router matrix in Mixture-of-Experts models to ensure better alignment with expert weights and faster convergence?

Mixture-of-Experts (MoE) models rely on routers to assign tokens to experts, but conventional router designs lack a mechanism to ensure these assignments reflect the actual features of the experts themselves. The authors introduce Manifold Power Iteration (MPI), a "Power-then-Retract" paradigm that forces router rows to align with the principal singular direction of their associated expert weight matrices. This principled alignment accelerates convergence and improves downstream performance across model scales up to 11B parameters, with negligible training overhead.

Paper Primer

The router matrix in an MoE model acts as a proxy for expert identity, yet it is typically optimized without explicit constraints to capture the expert's intrinsic geometry. This leads to sub-optimal token-expert assignment, where the router's internal representation fails to faithfully reflect the expert's most dominant features.

MPI is a router redesign: it uses a single step of power iteration to align router rows with the principal singular direction of the expert matrix, followed by a retraction step to maintain norm stability. The router acts like a compass needle: it continuously adjusts its orientation toward the expert's most informative direction, while the retraction keeps the needle from spinning out of control.

MPI consistently accelerates convergence and improves downstream performance across diverse model scales.

Pretraining experiments on 3B and 11B models show consistent perplexity reduction on validation, Math, and Code datasets compared to vanilla MoE. MPI maintains a persistent loss advantage throughout pretraining and improves accuracy on nine core benchmarks, including MMLU and GSM8K.

The design is computationally efficient and compatible with standard training pipelines.

MPI incurs only a 0.2% slowdown in training throughput and requires zero communication overhead. The method is optimizer-agnostic and requires no changes to inference engines, as router weights can be pre-computed.

Why is a single power iteration step sufficient for this alignment?

The authors observe that aggressive alignment via multiple iterations disrupts optimization stability. A single step provides a robust, efficient approximation that avoids the performance degradation seen with full convergence.

Does this method require a specific optimizer to function?

No, MPI is optimizer-agnostic. Experiments confirm it provides intrinsic improvements across different optimizers, including AdamW and MuonH, by imposing a structural constraint on the router's representation.

By replacing unconstrained router updates with a mathematically grounded alignment to expert singular directions, researchers can achieve more stable and efficient MoE training without sacrificing throughput.

Introduction and Motivation

We expose the MoE router bottleneck and propose Manifold Power Iteration to align rows with expert singular directions.

In Mixture‑of‑Experts models the router matrix sits at the core, projecting each token onto a set of expert rows to decide which experts are activated. Ideally each row should encode the essential features of its expert, but no principled constraint forces this condensation, leaving routers under‑specified and often harming convergence and overall model competence.

The router is the bottleneck of MoE efficiency; aligning each router row with the principal singular direction of its expert captures the most informative aspect of the expert’s weight matrix.

We therefore introduce Manifold Power Iteration (MPI), a “Power‑then‑Retract” update: a single power‑iteration step refines router rows toward the expert’s dominant singular vector, then a retraction enforces a fixed L2 norm to keep updates stable and efficient.

How does this differ from simply normalizing router rows after each update?

Normalization only rescales existing rows; it does not steer them toward the expert’s dominant singular direction. MPI explicitly moves the rows in the direction of maximal variance (via power iteration) before re‑scaling, ensuring the rows capture the most informative component of the expert weight matrix.

The router is the bottleneck of MoE efficiency.

Mixture-of-Experts Foundations

Background on MoE routers and the Router Matrix.

Mixture-of-Experts (MoE) language models route each input token to a sparse subset of expert modules. The router is implemented as a linear projection that produces a gating weight vector over the N experts.

The router stores a two‑dimensional weight matrix $R\in\mathbb{R}^{N\times D}$; each row $R[i]$ is a feature vector for expert $i$, and multiplying an input $x\in\mathbb{R}^{D}$ by $R^{\top}$ yields raw scores that become gating weights after selection and normalization.

In the standard design no explicit constraint forces each router row to preserve the geometry of its corresponding expert’s weights. Consequently, the affinity $x\cdot R[i]$ can be a poor proxy for the true compatibility between input $x$ and expert $i$.

Manifold Power Iteration

Dynamic routers align themselves with expert weights via a lightweight power‑iteration loop.

Static linear projections in MoE routers often fail to capture the dominant direction of each expert’s weight matrix, leading to noisy routing decisions. Aligning router rows with the principal singular direction of their experts promises more stable and accurate token‑expert assignments.

Iteratively pull each router row toward the dominant singular direction of its expert weights, then renormalize to keep the row norm bounded.

Power step: $\hat{R}[1] = (1,0)\,W_1^{g}\,W_1^{g\top} = (1,0)\,\begin{pmatrix}4 & 0\\0 & 1\end{pmatrix} = (4,0)$.

Norm of $\hat{R}[1]$ is $\|\hat{R}[1]\|_2 = 4$.

Retraction with $C=0.5$: $R'[1] = 0.5 \times \frac{(4,0)}{4} = (0.5,\,0)$.

After one iteration the router row points exactly along the principal direction (first axis) with reduced magnitude.

The power step amplifies alignment with the top singular direction, while the retraction prevents norm blow‑up, guaranteeing stable updates even as training progresses.

Fetch the expert weight matrix $W_i^{g}$ for router row $R[i]$.

Compute $\hat{R}[i] = R[i]\,W_i^{g}\,W_i^{g\top}$ (power iteration).

Normalize: $R'[i] = C \cdot \frac{\hat{R}[i]}{\|\hat{R}[i]\|_2}$ (L2 retraction).

Scale constant $C = C' / \sqrt{N}$ to keep logits $O(1)$.

Form logits $z = x\,R'^{\top}$, apply softmax, and take top‑$k$ gates.

Route tokens to the selected experts using the gated weights.

PyTorch‑style implementation of the Power‑then‑Retract router.

Design Principles and Optimization

We derive the manifold-constrained update rule for router weights, showing how it aligns with power iteration.

To optimize router weights while maintaining their position on a spherical manifold, we must constrain updates to the tangent space. We formulate this as a constrained maximization problem, where the router row $R'[i]$ is updated by $\Delta r$ to maximize the projection onto the expert weights $W_i^*$.

To solve this under the constraint that the router remains on the sphere, we use a first-order Taylor approximation. By projecting the gradient $G$ onto the tangent space, we derive the manifold-constrained update $\Delta r_g$.

Our router update $\Delta r_M$ approximates the exact power iteration, providing an adaptive step-size that slows down as the router aligns with the expert's principal singular vector.

Empirical Performance

MPI improves convergence and downstream performance across optimizers and scales.

MPI reduces pretraining loss by 0.013 compared to the vanilla MoE router.

Observed on the 1B MoE model trained with MuonH (Figure 2).

Table 1 shows that adding MPI raises average accuracy on the 25‑benchmark suite for every optimizer, reaching a peak of 43.98 % with MuonH + MPI.

Table 2 reports perplexity (bits per byte) across validation, Math, and Code domains; MPI consistently lowers PPL, e.g., validation PPL drops from 0.764 to 0.754 for the 3 B model.

**Figure 2.** Convergence comparisons for MoE with MPI, exemplified by MuonH-1B. Our router design achieves a 0.013 reduction in pretraining loss. Similar observations for other optimizers are provided in the Appendix.

**Figure 3.** Convergence and Downstream Performance Comparison. Manifold Power Iteration facilitates faster convergence and superior downstream task performance throughout the entire course of 11B MoE pretraining.

**Figure 4.** Load balancing loss for 3B MoE with MPI

MPI consistently improves convergence across different optimizers.

Mechanism Analysis and Ablations

We dissect how each router component influences alignment, stability, and performance.

We first examine whether our router‑expert alignment improves along the principal singular direction, then isolate each design choice through ablations.

The dominant eigenvector of an expert’s weight matrix, capturing the direction of greatest variance in its parameters.

Why does a single power‑iteration capture the principal singular direction well enough?

The dominant eigenvalue of a well‑conditioned expert matrix quickly overwhelms all others, so after one multiplication the resulting vector already aligns closely with the true eigenvector. Additional iterations yield diminishing returns while incurring extra compute.

A post‑update scaling that pulls router rows back toward unit norm, preventing norm explosion after power‑iteration.

How is Router Retraction different from the simple row‑wise normalization used in the Power‑Iteration ablation?

Row‑wise normalization rescales the original router R, discarding the information from the power‑iteration. Retraction rescales the *updated* row R′ᵢ, preserving the alignment benefit while still enforcing a norm bound.

**Table 5.** Comparison of $\lambda$ distributions. Router with Manifold Power Iteration exhibits an enhanced alignment between $R'_{[i]}$ and the principal singular direction of expert weights, manifested by significantly larger $\lambda$ values.

Removing Power Iteration reduces throughput by 5 % without any downstream accuracy gain.

Figure 5’s “MPI w.o Power‑Iter” line trails the full MPI curve, staying ≈5 % slower across the token range.

Omitting Router Retraction causes loss spikes and raises pre‑training loss by +0.003.

Figure 5’s “MPI w.o Retraction” line exhibits abrupt jumps and settles at a higher loss floor.

Increasing the power‑iteration count to 10 yields a 1.39‑point downstream accuracy drop.

Figure 5 shows the “MPI 10‑Iter” curve lagging behind the single‑iteration baseline, ending ≈1.39 pts lower.

**Figure 5.** Ablation studies for the key design choices: (1) Power Iteration and (2) Router Retraction. We observe pretraining collapses without Router Retraction when using AdamW and Muon, showcasing that Router Retraction is critical for maintaining training stability, especially for optimizers that lack weight constraints.

**Table 6.** Validation perplexity across choices of $C'$.

**Figure 6.** Pre-training loss comparison for a 1B MoE model across optimizers (AdamW, AdamH, Muon). MoE with MPI achieves a convergence advantages over all alternative setups.

Related Work and Conclusion

We discuss prior optimizers and close with the MPI router contribution.

We first situate our work among recent optimizer advances, then recap the MPI router contribution.

Muon orthogonalizes the momentum direction using a few Newton‑Schulz iterations, keeping updates aligned with the steepest‑descent direction under the spectral norm.

AdamW decouples weight decay from the Adam adaptive‑learning‑rate step, applying a pure L2 penalty after the moment‑based update.

Works that impose explicit norm constraints on both model weights and gradient updates to stabilize training at scale.

Demonstrates that Muon can efficiently pre‑train models up to trillions of parameters, confirming its scalability.

In conclusion, we revisited MoE router design through a row‑wise expert‑proxy lens and introduced Manifold Power Iteration (MPI), an efficient iterative scheme that aligns router rows with the principal singular direction of expert weights, delivering stable, scalable routing.

Supplementary Derivations

Derivation and approximation of the router row update used in Equation 10.

This appendix provides the detailed algebraic steps that lead to the approximate router update used in Equation 10.

Table 7 (referenced in the main text) lists the hyperparameter settings such as model dimension, number of layers, and expert counts that instantiate the symbols used above.

Additional tables give the sequence‑length versus vocabulary‑size configurations and a detailed breakdown of dense, sparse, and total parameter counts for each model variant.

Experimental Details

Implementation and evaluation details for the pretraining experiments.

This appendix records the concrete settings used for pretraining, the optimizer configurations, and the downstream evaluation protocol.

**Table 7.** Hyperparameters of model architectures.

**Table 8.** Pretraining hyperparameters (1B AdamW)

**Table 9.** Task-specific performance comparisons for 1B MoE with different optimizers.

Downstream performance is measured with the OLMES benchmark suite, which aggregates 25 multiple‑choice tasks; the nine core tasks listed above are reported in detail, while the remaining tasks are summarized in Table 9.

Questions & answers

What is the main contribution of this paper?

The paper introduces Manifold Power Iteration (MPI), a 'Power-then-Retract' router update paradigm for Mixture-of-Experts models that forces each router row to align with the principal singular direction of its corresponding expert weight matrix, improving training stability and downstream performance.

What problem does MPI address in Mixture-of-Experts models?

Conventional MoE routers lack any principled constraint ensuring that router rows reflect the actual geometry of their associated experts, causing token-expert assignments to be poor proxies for true compatibility and leading to sub-optimal convergence and model performance.

Why does misalignment between router rows and expert weights matter?

Without alignment, the affinity score between an input token and an expert can fail to capture the expert's most dominant features, resulting in noisy routing decisions that harm both convergence speed and final model quality.

How does MPI work mechanically?

MPI applies a single power-iteration step to move each router row toward the principal singular direction of the corresponding expert weight matrix, then applies a retraction step that rescales the updated row to maintain a fixed L2 norm, preserving the alignment benefit while ensuring update stability.

Why is only a single power-iteration step used rather than multiple steps?

Because the dominant eigenvalue of a well-conditioned expert matrix quickly overwhelms all others, one multiplication already aligns the resulting vector closely with the true principal singular direction; additional iterations yield diminishing returns while incurring extra compute and disrupting optimization stability.

How does MPI differ from simply normalizing router rows after each update?

Row-wise normalization only rescales the original router rows without steering them toward the expert's dominant singular direction, whereas MPI first moves the rows in the direction of maximal variance via power iteration and then rescales the updated rows, preserving the alignment information.

How does Router Retraction differ from row-wise normalization in the ablation?

Row-wise normalization rescales the original router R and discards the information gained from the power-iteration step, while retraction rescales the already-updated row R′ᵢ, preserving the alignment benefit while still enforcing a norm bound.

What downstream benchmark and evaluation protocol are used?

Downstream performance is measured with the OLMES benchmark suite, which aggregates 25 multiple-choice tasks; nine core tasks are reported in detail and the remaining tasks are summarized in a supplementary table.

What are the key quantitative results reported for MPI?

On the 25-benchmark OLMES suite, MPI raises average accuracy for every optimizer tested, reaching a peak of 43.98% with MuonH + MPI; on perplexity, MPI consistently lowers bits-per-byte across validation, Math, and Code domains, with validation perplexity dropping from 0.764 to 0.754 for the 3B model.

Does MPI require a specific optimizer to work?

No, MPI is optimizer-agnostic; the paper confirms it provides improvements across different optimizers, including AdamW and MuonH, by imposing a structural constraint on the router's representation rather than depending on any particular update rule.

What model scales are covered in the experiments?

The experiments cover model scales up to 11B parameters; the paper also reports results for a 3B model variant, among other sizes detailed in the supplementary hyperparameter tables.

What is the computational overhead of MPI?

The paper states that MPI introduces negligible training overhead, as it requires only a single power-iteration step and a retraction per router update rather than full iterative convergence.

How is the MPI update derived mathematically?

The update is formulated as a constrained maximization problem on a spherical manifold; using a first-order Taylor approximation, the gradient is projected onto the tangent space to derive the manifold-constrained update, with detailed algebraic steps provided in the supplementary appendix.

How does MPI compare to prior MoE router designs?

Prior router designs use unconstrained linear projections with no mechanism to align rows with expert geometry; MPI introduces a mathematically grounded alignment to expert singular directions, which the paper shows consistently improves convergence and accuracy over these baselines across all tested optimizers.

What ablations are performed to validate the design choices?

The paper isolates the contribution of each design choice, including the number of power-iteration steps and the retraction versus simple row-wise normalization, confirming that a single step with retraction is the optimal configuration.

What perplexity domains are evaluated?

Perplexity (measured in bits per byte) is reported across three domains: validation, Math, and Code, with MPI consistently lowering perplexity in all three.

Who are the authors and where was this paper published?

The paper does not specify author names or the publication venue in the provided text; it is available on arXiv at https://arxiv.org/abs/2606.12397.

Key terms

Mixture-of-Experts (MoE): A neural network architecture that routes each input token to a sparse subset of specialized sub-networks called experts, enabling large model capacity without activating all parameters for every input.
Router: A linear projection layer in an MoE model that computes gating scores to decide which experts each input token is assigned to.
Manifold Power Iteration (MPI): The paper's proposed router update method that aligns each router row with the principal singular direction of its expert's weight matrix using a single power-iteration step followed by a retraction onto a spherical manifold.
Power Iteration: An iterative numerical method that repeatedly multiplies a vector by a matrix to converge toward the matrix's dominant eigenvector (principal singular direction).
Principal Singular Direction: The direction in a weight matrix that captures the largest amount of variance, corresponding to the singular vector associated with the largest singular value.
Retraction: A step that projects an updated vector back onto a constrained manifold (here, a sphere of fixed L2 norm) after an unconstrained gradient update, preserving the alignment gained from power iteration.
Spherical Manifold: The geometric surface defined by all vectors with a fixed L2 norm, used here as the constraint set for router rows to ensure norm stability.
Tangent Space: The set of directions in which one can move from a point on a manifold while remaining on the manifold to first order, used to project gradients for manifold-constrained optimization.
Gating Weight: A scalar score produced by the router for each expert that determines how much each expert contributes to processing a given token.
AdamW: A widely used adaptive gradient optimizer that combines Adam's moment estimates with decoupled weight decay regularization.
MuonH: An optimizer referenced in the paper as one of the alternatives to AdamW used in experiments to confirm MPI's optimizer-agnostic improvements.
OLMES: A benchmark suite aggregating 25 multiple-choice evaluation tasks used in the paper to measure downstream language model performance.
Bits per Byte (BPB): A perplexity-like metric that measures how many bits a model requires on average to encode each byte of text, with lower values indicating better language modeling.
Token-Expert Assignment: The process by which the router decides which expert or experts will process each input token in an MoE forward pass.
Dominant Eigenvalue: The largest eigenvalue of a matrix, whose associated eigenvector represents the direction of greatest variance and is the target of power iteration.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers