mHC: Manifold-Constrained Hyper-Connections

Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Kuai Yu, Liang Zhao, Shangyan Zhou, Zhean Xu, Zhengyan Zhang, Wangding Zeng, Shengding Hu, Yuqing Wang, Jingyang Yuan, Lean Wang, Wenfeng Liang

Manifold-Constrained Hyper-Connections (mHC) restores training stability to wide-residual architectures.

How can we stabilize the training of Hyper-Connections (HC) by constraining their residual mappings to a specific manifold?

Expanding residual streams improves model performance but breaks the identity mapping property, causing signal divergence and training instability at scale. mHC projects these residual connections onto the Birkhoff polytope using the Sinkhorn-Knopp algorithm, forcing the mapping to be doubly stochastic and preserving signal intensity across layers. This approach eliminates the signal explosion seen in prior methods while maintaining performance gains, adding only 6.7% training overhead.

Paper Primer

The core move is to treat the residual connection matrix as a convex combination of permutations. By constraining the matrix to be doubly stochastic—where rows and columns sum to 1—the model ensures that signal propagation is non-expansive and stable across arbitrary depths.

mHC significantly improves training stability compared to unconstrained Hyper-Connections (HC).

The maximum gain magnitude of the composite mapping, which measures signal amplification, is reduced from ~3000 in HC to ~1.6 in mHC. Three orders of magnitude reduction in signal instability.

The method relies on the Sinkhorn-Knopp algorithm to iteratively normalize the residual matrix. To keep this efficient, the authors developed custom kernels that fuse these operations and overlap communication with computation, preventing the "memory wall" typically associated with wider residual streams.

Why does the identity mapping property matter so much for deep networks?

It acts as a conservation mechanism, ensuring that signal intensity remains invariant during forward and backward propagation. Without it, signals tend to explode or vanish, which prevents the model from scaling to deeper or larger architectures.

Is this method limited to specific model architectures?

The framework is general and was validated on Mixture-of-Experts (MoE) architectures ranging from 3B to 27B parameters. It is designed as a drop-in extension for any architecture relying on residual connections.

Researchers can now leverage the representational benefits of wider residual streams without sacrificing the training stability required for large-scale foundational models.

Introduction to Hyper-Connections

We expose why widening residual streams with Hyper‑Connections destabilises training and how mHC restores stability.

Deep networks rely on the residual connection $x_{l+1}=x_{l}+F(x_{l},W_{l})$, which preserves an identity path and thus stabilises training at scale.

HC widens the residual stream by inserting learnable linear maps before and after the core layer, thereby increasing the representational capacity of each block.

Compute the memory for one $H_{\text{res}}$: $256^{2}=65{,}536$ entries $\times$ 4 B = 262 KB.

For a 24‑layer block, total $H_{\text{res}}$ memory = $24\times262\text{ KB}\approx6.3\text{ MB}$.

By contrast, a standard residual connection stores only a $64\times64$ matrix (16 KB), a 40× reduction.

Even modest expansion ($n=4$) inflates the residual‑mapping memory dramatically, exposing a hidden cost that scales linearly with depth.

Because $H_{\text{res}}$ is unconstrained, the identity‑mapping property vanishes: the recursive expansion $x_{L}= \sum_{i} H_{\text{res}}^{L-i}x_{i}+ \dots$ no longer preserves the global feature mean, causing exploding or vanishing activations during large‑scale training. Moreover, the widened stream incurs substantial memory‑access overhead, limiting practical scalability.

**Figure 1.** Illustrations of Residual Connection Paradigms. This figure compares the structural design of (a) standard Residual Connection, (b) Hyper-Connections (HC), and (c) our proposed Manifold-Constrained Hyper-Connections (mHC). Unlike the unconstrained HC, mHC focuses on optimizing the residual connection space by projecting the matrices onto a constrained manifold to ensure stability.

The key shift is moving from a standard residual connection to a wider HC stream, which raises capacity but demands manifold constraints to retain stability.

Related Architectural Advancements

We survey micro‑ and macro‑design trends and situate mHC among recent macro approaches.

Early deep‑learning architectures focused on micro‑design: convolution introduced parameter sharing and translation invariance, later refined by depthwise‑separable and grouped convolutions for efficiency. The Transformer shifted the paradigm to attention and Feed‑Forward Networks, spawning efficient variants such as Multi‑Query Attention, Grouped‑Query Attention, and Multi‑Head Latent Attention, while Mixture‑of‑Experts enabled massive scaling of FFNs without proportional compute.

Macro‑design governs how layers interconnect; ResNet introduced residual streams, followed by DenseNet, Fractal‑Net, and Deep Layer Aggregation, each increasing topological complexity. Recent work expands the width of the residual stream via Hyper‑Connections, Residual Matrix Transformers, and MUDDFormer, but these broadened connections break the identity‑mapping property and inflate memory traffic.

Manifold‑Constrained Hyper‑Connections (mHC) address the instability and memory overhead of unconstrained width expansion by projecting the residual connection onto a stable manifold, thereby restoring identity mapping while preserving the topological benefits of wider streams.

The Instability of Hyper-Connections

HC widens residual streams but uncontrolled mappings cause instability and memory overhead.

Hyper‑Connections (HC) expand the residual stream by a factor $n$, turning a $1\times C$ input into an $n\times C$ hidden matrix. This extra width decouples information capacity from the layer’s input dimension, but the three learnable mappings ($H_{\text{pre}}$, $H_{\text{post}}$, $H_{\text{res}}$) introduce uncontrolled transformations. The resulting instability and memory pressure become the primary obstacles for scaling HC.

Compute residual width: $n\cdot C = 4\cdot512 = 2048$.

Attention matrix size: $L^2 = 128^2 = 16{,}384$ entries per head.

Total memory: $16{,}384 \times 2048 \times 4\text{ B} \approx 134\text{ MB}$.

This toy calculation shows that even modest expansion quickly overwhelms GPU memory, illustrating why HC’s unconstrained mappings are a scalability bottleneck.

The unconstrained product $\mathbf{M}_{l\to L}$ can amplify or attenuate signals exponentially, breaking the residual‑learning premise of smooth gradient flow. To quantify this effect the authors define the $A_{\max}$ Gain Magnitude, measuring worst‑case row‑sum (forward) and column‑sum (backward) expansions of $\mathbf{M}_{l\to L}$.

**Figure 2.** Training Instability of Hyper-Connections (HC). This figure illustrates (a) the absolute loss gap of HC relative to mHC, and (b) the comparisons of gradient norms. All results are based on 27B models.

**Figure 3.** Propagation Instability of Hyper-Connections (HC). This figure illustrates the propagation dynamics of (a) the single-layer mapping $\mathcal{H}_l^{\text{res}}$ and (b) the composite mapping $\prod_{i=1}^{L-l} \mathcal{H}_{L-i}^{\text{res}}$ within the 27B model. The layer index $l$ (x-axis) unrolls each standard Transformer block into two independent layers (Attention and FFN). The Amax Gain Magnitude (y-axis) is calculated as the maximum absolute row sum (for the forward signal) and column sum (for the backward gradient), averaged over all tokens in a selected sequence.

**Table 1.** Ablation Study of HC Components. When a specific mapping ($H_l^{pre}$, $H_l^{post}$, or $H_l^{res}$) is disabled, we employ a fixed mapping to maintain dimensional consistency: uniform weights of $1/n$ for $H_l^{pre}$, uniform weights of ones for $H_l^{post}$, and the identity matrix for $H_l^{res}$.

The mHC Mechanism

Manifold-Constrained HC projects residual mappings onto a doubly‑stochastic manifold to keep signals stable while allowing interaction.

Hyper‑Connections widen the residual stream, but without any restriction the learned mapping $H_{\text{res}}$ can amplify or cancel signals, leading to exploding gradients and unstable training.

Instead of letting the residual mapping be arbitrary, we force it onto the set of doubly‑stochastic matrices, which guarantees each row and column sums to one and all entries stay non‑negative.

How does Manifold Projection differ from ordinary weight‑normalization of $H_{\text{res}}$?

Weight‑normalization rescales each row (or column) independently but does not enforce the opposite dimension’s sum‑to‑one constraint. Manifold Projection simultaneously forces both rows and columns to sum to 1 and requires non‑negativity, which yields a true stochastic operator with guaranteed non‑expansiveness and closure under multiplication.

Row‑normalize: divide each row by its sum → $\begin{bmatrix}0.25 & 0.75\\ 0.78 & 0.22\end{bmatrix}$.

Column‑normalize: divide each column by its sum → $\begin{bmatrix}0.24 & 0.77\\ 0.76 & 0.23\end{bmatrix}$.

After a second iteration the matrix stabilizes at $\begin{bmatrix}0.25 & 0.75\\ 0.75 & 0.25\end{bmatrix}$, which satisfies both row and column sums equal to 1.

The projection turns an arbitrary mixing matrix into a convex combination of the two permutation matrices $\begin{bmatrix}1&0\\0&1\end{bmatrix}$ and $\begin{bmatrix}0&1\\1&0\end{bmatrix}$, illustrating the geometric view of the Birkhoff polytope.

The three theoretical benefits follow directly from the doubly‑stochastic constraint: (1) the spectral norm never exceeds 1, preventing gradient explosion; (2) multiplying any number of such matrices stays within the same set, so depth does not degrade stability; (3) because the set is the convex hull of permutations, the mapping continuously blends simple re‑orderings, providing a rich yet well‑behaved interaction among residual streams.

**Figure 7.** Propagation Stability of Manifold-Constrained Hyper-Connections (mHC). This figure illustrates the propagation dynamics of (a) the single-layer mapping $\mathcal{P}_{\mathcal{M}^{res}}(\mathcal{H}_l^{res})$ and (b) the composite mapping $\prod_{i=1}^{L-1} \mathcal{P}_{\mathcal{M}^{res}}(\mathcal{H}_{L-i}^{res})$ within the 27B model. The results demonstrate that mHC significantly enhances propagation stability compared to HC.

**Figure 8.** Visualizations of Learnable Mappings. This figure displays representative single-layer and composite mappings for HC (first row) and mHC (second row). Each matrix is computed by averaging over all tokens within a selected sequence. The labels annotated along the y-axis and x-axis indicate the forward signal gain (row sum) and the backward gradient gain (column sum), respectively.

**Table 2.** Comparison of Memory Access Costs Per Token. This analysis accounts for the overhead introduced by the residual stream maintenance in the forward pass, excluding the internal I/O of the layer function $\mathcal{F}$.

Infrastructure and Implementation

Optimized kernels and a refined DualPipe schedule keep mHC fast and memory‑light.

Manifold‑Constrained Hyper‑Connections (mHC) boost capacity but add latency and memory pressure; the infrastructure below removes those bottlenecks.

First, the hidden matrix $x_\ell\in\mathbb{R}^{n\times C}$ is flattened, RMS‑normalized, and passed through three linear projections to obtain $\tilde H^{\text{pre}}_\ell$, $\tilde H^{\text{post}}_\ell$, and $\tilde H^{\text{res}}_\ell$, which are then constrained by sigmoid or Sinkhorn‑Knopp to yield $H^{\text{pre}}_\ell$, $H^{\text{post}}_\ell$, $H^{\text{res}}_\ell$.

Instead of waiting for a whole pipeline stage to finish before sending its gradients, DualPipe runs a high‑priority compute stream that finishes the expensive MLP/Attention kernels while the communication stream is still busy.

Stage 0 computes $F_{\text{pre}}$ for layers 0‑1 on the normal stream.

As soon as $x_{l0}$ for layer 2 is cached, the high‑priority stream launches $F_{\text{post,res}}$ for layer 1.

While the communication stream transfers layer 1’s gradients, the high‑priority stream finishes $F_{\text{post,res}}$ for layer 2.

Synchronization occurs only after both streams finish, allowing the next stage to start without waiting for the full residual merge.

The overlap reduces the effective per‑stage latency from $2\cdot L_r$ to roughly $L_r+1$, matching the theoretical speedup of DualPipe.

Fused kernel pipeline for RMSNorm‑mat‑projection.

Recomputing discards the intermediate activations of the fused kernels after the forward pass and re‑executes them during back‑propagation, cutting the stored activation volume from $(3n+1)C$ to $(n+1)C$ per block.

**Table 1.** Activation storage requirements for mHC kernels.

**Figure 4 | Communication-Computation Overlapping for mHC.** We extend the DualPipe schedule to handle the overhead introduced by mHC. Lengths of each block are illustrative only and do not represent actual duration. (F), (B), (W) refers to forward pass, backward pass, weight gradient computation, respectively. $\mathcal{F}^A$ and $\mathcal{F}^M$ represents kernels corresponded to Attention and MLP, respectively.

How does this DualPipe schedule differ from the original DualPipe schedule used for expert parallelism?

The original schedule only overlapped expert‑layer communication with the main compute stream. Here we add a third, high‑priority stream that runs the residual‑merge kernels as soon as their inputs are ready, so communication never blocks the critical path.

Experimental Results

mHC consistently outperforms HC and the baseline in stability and downstream accuracy.

Recall that Hyper‑Connections (HC) widen the residual stream but suffer from unstable mappings; Manifold‑Constrained Hyper‑Connections (mHC) project those mappings onto a stable manifold to restore training stability.

It is a mixture‑of‑experts (MoE) language model that serves as the reference architecture for all ablations.

How does the DeepSeek‑V3 baseline differ from a model that already uses Hyper‑Connections?

The baseline contains only the standard residual stream; HC inserts additional learned linear maps that widen that stream, while mHC further constrains those maps. Thus any performance gain over the baseline isolates the contribution of the connection mechanisms.

Manifold‑Constrained Hyper‑Connections improve BBH accuracy by +2.1 % over Hyper‑Connections.

Table 4 shows 51.0 % for mHC versus 48.9 % for HC on the 3‑shot BBH benchmark.

**Figure 5.** Training Stability of Manifold-Constrained Hyper-Connections (mHC). This figure illustrates (a) the absolute training loss gap of mHC and HC relative to the baseline, and (b) the gradient norm of the three methods. All experiments utilize the 27B model. The results demonstrate that mHC exhibits improved stability in terms of both loss and gradient norm.

Table 4 reports zero‑shot and few‑shot scores on eight benchmarks; mHC attains the top score on seven of them, with the only exception being MATH where HC marginally leads.

**Figure 6.** Scaling properties of mHC compared to the Baseline. (a) Compute Scaling Curve. Solid lines depict the performance gap across different compute budgets. Each point represents a specific compute-optimal configuration of model size and dataset size, scaling from 3B and 9B to 27B parameters. (b) Token Scaling Curve. Trajectory of the 3B model during training. Each point represents the model's performance at different training tokens. Detailed architectures and training configurations are provided in Appendix A.1.

mHC consistently outperforms HC and the baseline in both training stability and downstream benchmark accuracy.

Model Specifications

Provides full model specifications and hyper‑parameters for the evaluated DeepSeek‑V3 variants.

This appendix enumerates every architectural and training hyper‑parameter used in the DeepSeek‑V3 experiments, organized by model scale.

**Table 5.** Detailed Model Specifications and Hyper-parameters. This table presents the architectural configurations for the 3B, 9B, and 27B models based on the DeepSeek-V3 (Liu et al., 2024b) architecture. It outlines the specific hyper-parameters for mHC and HC, including the residual stream expansion and Sinkhorn-Knopp settings, alongside the optimization and training protocols used in the experiments.

Read the original paper

Open the simplified reader on Paperglide