Redesign Mixture-of-Experts Routers with Manifold Power Iteration
Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin
Manifold Power Iteration aligns Mixture-of-Experts routers with expert features for faster, more stable training.
How can we improve the router matrix in Mixture-of-Experts models to ensure better alignment with expert weights and faster convergence?
Mixture-of-Experts (MoE) models rely on routers to assign tokens to experts, but conventional router designs lack a mechanism to ensure these assignments reflect the actual features of the experts themselves. The authors introduce Manifold Power Iteration (MPI), a "Power-then-Retract" paradigm that forces router rows to align with the principal singular direction of their associated expert weight matrices. This principled alignment accelerates convergence and improves downstream performance across model scales up to 11B parameters, with negligible training overhead.
Paper Primer
The router matrix in an MoE model acts as a proxy for expert identity, yet it is typically optimized without explicit constraints to capture the expert's intrinsic geometry. This leads to sub-optimal token-expert assignment, where the router's internal representation fails to faithfully reflect the expert's most dominant features.
MPI is a router redesign: it uses a single step of power iteration to align router rows with the principal singular direction of the expert matrix, followed by a retraction step to maintain norm stability. The router acts like a compass needle: it continuously adjusts its orientation toward the expert's most informative direction, while the retraction keeps the needle from spinning out of control.
MPI consistently accelerates convergence and improves downstream performance across diverse model scales.
Pretraining experiments on 3B and 11B models show consistent perplexity reduction on validation, Math, and Code datasets compared to vanilla MoE. MPI maintains a persistent loss advantage throughout pretraining and improves accuracy on nine core benchmarks, including MMLU and GSM8K.
The design is computationally efficient and compatible with standard training pipelines.
MPI incurs only a 0.2% slowdown in training throughput and requires zero communication overhead. The method is optimizer-agnostic and requires no changes to inference engines, as router weights can be pre-computed.
Why is a single power iteration step sufficient for this alignment?
The authors observe that aggressive alignment via multiple iterations disrupts optimization stability. A single step provides a robust, efficient approximation that avoids the performance degradation seen with full convergence.
Does this method require a specific optimizer to function?
No, MPI is optimizer-agnostic. Experiments confirm it provides intrinsic improvements across different optimizers, including AdamW and MuonH, by imposing a structural constraint on the router's representation.
By replacing unconstrained router updates with a mathematically grounded alignment to expert singular directions, researchers can achieve more stable and efficient MoE training without sacrificing throughput.
Introduction and Motivation
We expose the MoE router bottleneck and propose Manifold Power Iteration to align rows with expert singular directions.
In Mixture‑of‑Experts models the router matrix sits at the core, projecting each token onto a set of expert rows to decide which experts are activated. Ideally each row should encode the essential features of its expert, but no principled constraint forces this condensation, leaving routers under‑specified and often harming convergence and overall model competence.
The router is the bottleneck of MoE efficiency; aligning each router row with the principal singular direction of its expert captures the most informative aspect of the expert’s weight matrix.
We therefore introduce Manifold Power Iteration (MPI), a “Power‑then‑Retract” update: a single power‑iteration step refines router rows toward the expert’s dominant singular vector, then a retraction enforces a fixed L2 norm to keep updates stable and efficient.
How does this differ from simply normalizing router rows after each update?
Normalization only rescales existing rows; it does not steer them toward the expert’s dominant singular direction. MPI explicitly moves the rows in the direction of maximal variance (via power iteration) before re‑scaling, ensuring the rows capture the most informative component of the expert weight matrix.
The router is the bottleneck of MoE efficiency.
Mixture-of-Experts Foundations
Background on MoE routers and the Router Matrix.
Mixture-of-Experts (MoE) language models route each input token to a sparse subset of expert modules. The router is implemented as a linear projection that produces a gating weight vector over the N experts.
The router stores a two‑dimensional weight matrix $R\in\mathbb{R}^{N\times D}$; each row $R[i]$ is a feature vector for expert $i$, and multiplying an input $x\in\mathbb{R}^{D}$ by $R^{\top}$ yields raw scores that become gating weights after selection and normalization.
In the standard design no explicit constraint forces each router row to preserve the geometry of its corresponding expert’s weights. Consequently, the affinity $x\cdot R[i]$ can be a poor proxy for the true compatibility between input $x$ and expert $i$.
Manifold Power Iteration
Dynamic routers align themselves with expert weights via a lightweight power‑iteration loop.
Static linear projections in MoE routers often fail to capture the dominant direction of each expert’s weight matrix, leading to noisy routing decisions. Aligning router rows with the principal singular direction of their experts promises more stable and accurate token‑expert assignments.
Iteratively pull each router row toward the dominant singular direction of its expert weights, then renormalize to keep the row norm bounded.
Power step: $\hat{R}[1] = (1,0)\,W_1^{g}\,W_1^{g\top} = (1,0)\,\begin{pmatrix}4 & 0\\0 & 1\end{pmatrix} = (4,0)$.
Norm of $\hat{R}[1]$ is $\|\hat{R}[1]\|_2 = 4$.
Retraction with $C=0.5$: $R'[1] = 0.5 \times \frac{(4,0)}{4} = (0.5,\,0)$.
After one iteration the router row points exactly along the principal direction (first axis) with reduced magnitude.
The power step amplifies alignment with the top singular direction, while the retraction prevents norm blow‑up, guaranteeing stable updates even as training progresses.
Fetch the expert weight matrix $W_i^{g}$ for router row $R[i]$.
Compute $\hat{R}[i] = R[i]\,W_i^{g}\,W_i^{g\top}$ (power iteration).
Normalize: $R'[i] = C \cdot \frac{\hat{R}[i]}{\|\hat{R}[i]\|_2}$ (L2 retraction).
Scale constant $C = C' / \sqrt{N}$ to keep logits $O(1)$.
Form logits $z = x\,R'^{\top}$, apply softmax, and take top‑$k$ gates.
Route tokens to the selected experts using the gated weights.
PyTorch‑style implementation of the Power‑then‑Retract router.
Design Principles and Optimization
We derive the manifold-constrained update rule for router weights, showing how it aligns with power iteration.
To optimize router weights while maintaining their position on a spherical manifold, we must constrain updates to the tangent space. We formulate this as a constrained maximization problem, where the router row $R'[i]$ is updated by $\Delta r$ to maximize the projection onto the expert weights $W_i^*$.
To solve this under the constraint that the router remains on the sphere, we use a first-order Taylor approximation. By projecting the gradient $G$ onto the tangent space, we derive the manifold-constrained update $\Delta r_g$.
Our router update $\Delta r_M$ approximates the exact power iteration, providing an adaptive step-size that slows down as the router aligns with the expert's principal singular vector.
Empirical Performance
MPI improves convergence and downstream performance across optimizers and scales.
MPI reduces pretraining loss by 0.013 compared to the vanilla MoE router.
Observed on the 1B MoE model trained with MuonH (Figure 2).
Table 1 shows that adding MPI raises average accuracy on the 25‑benchmark suite for every optimizer, reaching a peak of 43.98 % with MuonH + MPI.
Table 2 reports perplexity (bits per byte) across validation, Math, and Code domains; MPI consistently lowers PPL, e.g., validation PPL drops from 0.764 to 0.754 for the 3 B model.
**Figure 2.** Convergence comparisons for MoE with MPI, exemplified by MuonH-1B. Our router design achieves a 0.013 reduction in pretraining loss. Similar observations for other optimizers are provided in the Appendix.
**Figure 3.** Convergence and Downstream Performance Comparison. Manifold Power Iteration facilitates faster convergence and superior downstream task performance throughout the entire course of 11B MoE pretraining.
**Figure 4.** Load balancing loss for 3B MoE with MPI
MPI consistently improves convergence across different optimizers.
Mechanism Analysis and Ablations
We dissect how each router component influences alignment, stability, and performance.
We first examine whether our router‑expert alignment improves along the principal singular direction, then isolate each design choice through ablations.
The dominant eigenvector of an expert’s weight matrix, capturing the direction of greatest variance in its parameters.
Why does a single power‑iteration capture the principal singular direction well enough?
The dominant eigenvalue of a well‑conditioned expert matrix quickly overwhelms all others, so after one multiplication the resulting vector already aligns closely with the true eigenvector. Additional iterations yield diminishing returns while incurring extra compute.
A post‑update scaling that pulls router rows back toward unit norm, preventing norm explosion after power‑iteration.
How is Router Retraction different from the simple row‑wise normalization used in the Power‑Iteration ablation?
Row‑wise normalization rescales the original router R, discarding the information from the power‑iteration. Retraction rescales the *updated* row R′ᵢ, preserving the alignment benefit while still enforcing a norm bound.
**Table 5.** Comparison of $\lambda$ distributions. Router with Manifold Power Iteration exhibits an enhanced alignment between $R'_{[i]}$ and the principal singular direction of expert weights, manifested by significantly larger $\lambda$ values.
Removing Power Iteration reduces throughput by 5 % without any downstream accuracy gain.
Figure 5’s “MPI w.o Power‑Iter” line trails the full MPI curve, staying ≈5 % slower across the token range.
Omitting Router Retraction causes loss spikes and raises pre‑training loss by +0.003.
Figure 5’s “MPI w.o Retraction” line exhibits abrupt jumps and settles at a higher loss floor.
Increasing the power‑iteration count to 10 yields a 1.39‑point downstream accuracy drop.
Figure 5 shows the “MPI 10‑Iter” curve lagging behind the single‑iteration baseline, ending ≈1.39 pts lower.
**Figure 5.** Ablation studies for the key design choices: (1) Power Iteration and (2) Router Retraction. We observe pretraining collapses without Router Retraction when using AdamW and Muon, showcasing that Router Retraction is critical for maintaining training stability, especially for optimizers that lack weight constraints.
**Table 6.** Validation perplexity across choices of $C'$.
**Figure 6.** Pre-training loss comparison for a 1B MoE model across optimizers (AdamW, AdamH, Muon). MoE with MPI achieves a convergence advantages over all alternative setups.
Related Work and Conclusion
We discuss prior optimizers and close with the MPI router contribution.
We first situate our work among recent optimizer advances, then recap the MPI router contribution.
Muon orthogonalizes the momentum direction using a few Newton‑Schulz iterations, keeping updates aligned with the steepest‑descent direction under the spectral norm.
AdamW decouples weight decay from the Adam adaptive‑learning‑rate step, applying a pure L2 penalty after the moment‑based update.
Works that impose explicit norm constraints on both model weights and gradient updates to stabilize training at scale.
Demonstrates that Muon can efficiently pre‑train models up to trillions of parameters, confirming its scalability.
In conclusion, we revisited MoE router design through a row‑wise expert‑proxy lens and introduced Manifold Power Iteration (MPI), an efficient iterative scheme that aligns router rows with the principal singular direction of expert weights, delivering stable, scalable routing.
Supplementary Derivations
Derivation and approximation of the router row update used in Equation 10.
This appendix provides the detailed algebraic steps that lead to the approximate router update used in Equation 10.
Table 7 (referenced in the main text) lists the hyperparameter settings such as model dimension, number of layers, and expert counts that instantiate the symbols used above.
Additional tables give the sequence‑length versus vocabulary‑size configurations and a detailed breakdown of dense, sparse, and total parameter counts for each model variant.
Experimental Details
Implementation and evaluation details for the pretraining experiments.
This appendix records the concrete settings used for pretraining, the optimizer configurations, and the downstream evaluation protocol.
**Table 7.** Hyperparameters of model architectures.
**Table 8.** Pretraining hyperparameters (1B AdamW)
**Table 9.** Task-specific performance comparisons for 1B MoE with different optimizers.
Downstream performance is measured with the OLMES benchmark suite, which aggregates 25 multiple‑choice tasks; the nine core tasks listed above are reported in detail, while the remaining tasks are summarized in Table 9.