On Subquadratic Architectures: From Applications to Principles

Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

xLSTM outperforms Mamba-2 and Gated DeltaNet on complex tasks by combining robust state tracking with counting-like accumulation.

Which subquadratic sequence-modeling architectures best handle complex dependencies, and why do they differ in performance?

Subquadratic sequence models are essential for scaling foundation models, but their performance varies significantly on tasks requiring long-range, structured dependencies like code generation and time-series forecasting. The authors unify xLSTM, Mamba-2, and Gated DeltaNet into a single framework to isolate how their memory dynamics—specifically gating and state updates—enable or hinder accumulation and state tracking. xLSTM consistently leads across these complex domains, with synthetic experiments confirming it is the only architecture that effectively combines counting-like accumulation with finite-state tracking.

Paper Primer

The paper evaluates three leading subquadratic backbones—xLSTM, Mamba-2, and Gated DeltaNet—on data where architectural inductive biases diverge sharply. While these models perform similarly on standard language benchmarks, they show distinct failure modes on tasks requiring formal syntax, variable binding, or continuous-valued dynamics.

The authors propose a unified formulation that maps each architecture to its memory-update primitives: accumulation and state tracking. xLSTM is the superior design: it separates linear-attention matrix states from non-linear recurrent updates, allowing it to perform flexible memory correction while maintaining stable state tracking.

xLSTM backbones consistently outperform Mamba-2 and Gated DeltaNet on complex structured tasks.

In code-focused pre-training (400M parameters), xLSTM [7:1] leads the next-best backbone by up to 1.81 points on HumanEval pass@64. Consistent lead across code generation, Transformer distillation, and time-series foundation modeling.

xLSTM is the only architecture that successfully generalizes both counting and state-tracking tasks.

Synthetic experiments show xLSTM [1:1] solves state-tracking (e.g., Parity, S3) while retaining counting extrapolation, whereas Mamba-2 and Gated DeltaNet fail on one or both primitives. xLSTM [1:1] achieves 100% accuracy on state-tracking tasks at 16x training length.

Why do these architectures perform similarly on standard benchmarks but diverge on code and time series?

Standard benchmarks often rely on commonsense reasoning where performance differences are small; code and time series impose structured, long-range dependencies that force models to rely on specific memory primitives like accumulation and state tracking.

What is the core architectural difference between xLSTM and the other subquadratic models?

Mamba-2 ties its input and forget gates, limiting expressivity, while Gated DeltaNet uses explicit overwriting that aids retrieval but hinders counting. xLSTM uses independent gates and a hybrid structure to combine linear-attention accumulation with non-linear recurrent state tracking.

For practitioners building foundation models on structured data, xLSTM-family backbones provide a more robust inductive bias than Mamba-2 or Gated DeltaNet by balancing accumulation and state tracking.

Introduction and Motivation

Subquadratic models promise cheaper sequence processing, but their effectiveness hinges on state handling.

Transformers dominate modern sequence modeling but their quadratic attention incurs substantial computational cost. Subquadratic architectures promise a cheaper alternative, yet it remains unclear which designs handle state accumulation and long‑range dependencies most effectively. This uncertainty motivates a systematic comparison of leading subquadratic backbones.

We need sequence models that scale to long inputs without exploding compute, while still preserving the ability to track and update a rich internal state.

These mechanisms avoid the full n × n attention matrix by maintaining a compact state that is updated incrementally as tokens are processed.

**Figure 1.** **Tasks with complex dependencies.** Code (a) carries dependencies in formal structure: syntax trees, call graphs, variable bindings. Time series (b) carries them in partially observed dynamics: trajectories of complex systems (here, a Lorenz attractor) whose future depends on unobserved states over history. Both are representative of complex dependencies where modeling requires tracking many interacting states across long contexts.

Table 18 (see the appendix) enumerates the hyperparameters used for the synthetic length‑generalization experiments that probe accumulation and state‑tracking capabilities.

The core trade‑off is between the quadratic cost of full attention and the subquadratic scalability of recurrent‑style state operators.

Empirical Performance on Complex Dependencies

Empirical results show xLSTM backbones consistently outperform other subquadratic operators on code and time‑series tasks.

xLSTM [7:1] achieves the highest HumanEval pass@64 score, improving over the next‑best subquadratic backbone by +1.43 points at 20 B code tokens and by +0.90 points at 100 B tokens.

Pass@64 comparison on the Nemotron‑CC‑Code‑v1 pre‑training runs (20 B and 100 B tokens) shows xLSTM [7:1] ahead of Gated DeltaNet, the runner‑up.

**Figure 2.** **HumanEval pass@k after code-focused pre-training.** Results for 400M-parameter hybrid language models trained under the matched pre-training recipe on two data configurations: Nemotron-CC-Code-v1 for 20B tokens, Nemotron-CC-Code-v1 for 100B tokens. For 100B tokens, the gap between the different subquadratic backbones shrinks.

Performance gaps in code‑focused language models consistently favor xLSTM‑family backbones.

Architectural Analysis

We dissect the core linear‑attention tricks behind xLSTM, Mamba‑2, and Gated DeltaNet.

Subquadratic attention eliminates the quadratic cost of vanilla Transformers, but the recurrent formulation that enables O(T) scaling is slow on GPUs. The community therefore introduced chunk‑wise tricks to regain parallelism, yet the choice of gating and state‑update mechanisms still determines how well these models accumulate information over long horizons.

xLSTM stitches a linear‑attention accumulator with a traditional recurrent cell, letting the model both aggregate a fast‑moving state and apply nonlinear gating. Think of it as a conveyor belt that periodically passes items through a filter that can boost or suppress them.

t=1: compute $i_1=\exp(w_i x_1)=\exp(0.5)=1.65$, $f_1=\sigma(w_f x_1)=0.62$, $v_1=\tanh(W_v x_1)= (0.76,0.0)$. Update $C_1= i_1\,k_1\otimes v_1 = 1.65\cdot (1,0)^\top (0.76,0) = \begin{bmatrix}1.25 & 0\\0 & 0\end{bmatrix}$.

t=2: $i_2=\exp(w_i x_2)=\exp(0.3)=1.35$, $f_2=0.55$, $v_2=\tanh(W_v x_2)=(0,0.76)$. $C_2 = f_2 C_1 + i_2 k_2\otimes v_2 = 0.55\cdot C_1 + 1.35\cdot (0,1)^\top (0,0.76) = \begin{bmatrix}0.69 & 0\\0 & 1.03\end{bmatrix}$.

t=3: $i_3=\exp(w_i x_3)=\exp(0.8)=2.23$, $f_3=0.70$, $v_3=\tanh(W_v x_3)=(0.76,0.76)$. $C_3 = f_3 C_2 + i_3 k_3\otimes v_3 = 0.70\cdot C_2 + 2.23\cdot (1,1)^\top (0.76,0.76) = \begin{bmatrix}2.12 & 2.12\\2.12 & 2.12\end{bmatrix}$.

Final hidden output $h_3 = q_3 C_3 = e_1 C_3$ extracts the first row, yielding $h_3 = (2.12, 2.12)$.

The exponential input gate lets the model sharply overwrite earlier contributions, while the forget gate smoothly decays the accumulated matrix.

How does xLSTM’s linear‑attention head differ from the vanilla linear attention described earlier?

Vanilla linear attention simply accumulates $k_t\otimes v_t$ without any gating, so every past value contributes equally. xLSTM introduces an exponential input gate $i_t$ and a sigmoid forget gate $f_t$, which can amplify important timesteps and attenuate or erase stale ones, giving the model a soft‑max‑like ability to prioritize recent information.

Mamba‑2 reuses the linear‑attention state but forces the input and forget gates to share the same learned signal, making the update resemble a GRU’s single‑gate dynamics.

t=1: $i_1=\text{softplus}(0.5\cdot 1)=\text{softplus}(0.5)=0.97$, $f_1=1-\sigma(0.5)\cdot0.3=1-0.62\cdot0.3=0.81$. Update $C_1=0.81\cdot0 + 0.97\cdot v_1 = 0.97\,v_1$.

t=2: $i_2=\text{softplus}(0.5\cdot 2)=\text{softplus}(1)=1.31$, $f_2=1-\sigma(1)\cdot0.3=1-0.73\cdot0.3=0.78$. $C_2 = 0.78\,C_1 + 1.31\,v_2$.

Because $i_t$ and $f_t$ move together, the contribution of $v_1$ is heavily damped by $f_2$, illustrating limited ability to preserve early information.

The tied gates cause early contributions to decay quickly, which explains Mamba‑2’s difficulty with counting long sequences.

Why does tying the input and forget gates make Mamba‑2 less expressive than xLSTM?

When the same linear projection controls both gates, a large input gate automatically forces a small forget gate (and vice‑versa). This coupling prevents the model from simultaneously writing a strong new value while keeping the old state, limiting the range of possible state trajectories.

Gated DeltaNet augments the linear‑attention state with an orthogonal projection that erases any component aligned with the current key before writing, yielding a sharper overwrite mechanism.

Compute gated addition: $i_t\,\frac{k_t}{\|k_t\|}\otimes v_t = 0.8 \cdot (1,0)^\top (0,1) = \begin{bmatrix}0 & 0.8\\0 & 0\end{bmatrix}$.

Update before projection: $C'_t = f_t C_{t-1} +$ gated addition $= 0.6\begin{bmatrix}0.5 & 0\\0 & 0.5\end{bmatrix} + \begin{bmatrix}0 & 0.8\\0 & 0\end{bmatrix} = \begin{bmatrix}0.3 & 0.8\\0 & 0.3\end{bmatrix}$.

Projection matrix: $\text{proj}_t = I - \frac{k_t\otimes k_t}{\|k_t\|^2} = \begin{bmatrix}0 & 0\\0 & 1\end{bmatrix}$.

Apply projection: $C_t = \text{proj}_t\,C'_t = \begin{bmatrix}0 & 0\\0 & 0.3\end{bmatrix}$ — the entire first row (aligned with $k_t$) is cleared.

The orthogonal projection completely removes any component that lies in the direction of the current key, which prevents the state from accumulating redundant information.

What practical effect does the orthogonal projection have compared to the plain linear‑attention update?

Without the projection, each new $k_t\otimes v_t$ would add a component that aligns with $k_t$, potentially causing interference when the same key appears later. The projection zeroes out that aligned component, so later writes can replace earlier ones cleanly, improving retrieval but also discarding any useful signal that happens to share the key direction.

State Tracking and Accumulation

We ablate accumulation and state‑tracking components to see how each affects long‑range performance.

We evaluate each architecture on synthetic tasks that isolate accumulation (counting) and finite‑state tracking (parity, modular arithmetic, $S_3$ word problems). All models share the same training length $128$ and are tested at $128$, $512$, and $2048$ to probe extrapolation.

Accumulation continuously aggregates incoming information (e.g., a running sum), while state tracking maintains a bounded discrete state that can be updated and queried.

How does “state tracking” differ from simple accumulation?

Accumulation merely adds values, discarding order; state tracking updates a bounded representation that can encode order‑sensitive properties such as parity or modular residues.

Load a trained model checkpoint (trained at length $128$).

Generate synthetic inputs for each task at the target length ($128$, $512$, $2048$).

Run the model forward pass and record task‑specific accuracy.

Aggregate results across random seeds and plot the accuracy curves (Figure 4).

xLSTM[1:0] attains high counting accuracy (e.g., $0.892$ on AnBn at length $2048$) while completely failing state‑tracking tasks.

Table 13 shows $0.892$ accuracy for AnBn at $2048$, but parity accuracy remains near $0.0$.

xLSTM[1:0] achieves essentially zero accuracy on all state‑tracking benchmarks (parity, modular arithmetic, $S_3$) even at the training length.

Parity accuracy is $0.0$ at length $128$ and stays $0.0$ up to $2048$ (Table 13).

xLSTM[1:1] solves every state‑tracking task perfectly (accuracy $1.0$) and retains useful counting performance (e.g., $0.763$ on Majority at length $2048$).

Figure 4 reports $1.0$ parity accuracy across all lengths and $0.763$ majority counting at $2048$.

Gated DeltaNet [−1, 1] restores parity accuracy to $0.472$ at length $2048$, but modular arithmetic and counting still degrade.

Parity drops from $1.0$ at $128$ to $0.472$ at $2048$; modular arithmetic remains $0.452$ (Table 13).

Mamba‑2 collapses on both counting and state‑tracking, with AnBn accuracy falling to $0.241$ at length $2048$.

Table 13 records $0.241$ for AnBn at $2048$ and parity never exceeds $0.352$.

**Figure 4.** **Length generalization on accumulation and state-tracking.** Two representative tasks (Majority counting on the left, parity on the right) on which contemporary subquadratic designs diverge. Models are trained at length 128 (dotted line) and evaluated at 128, 512, and 2048; the break on the x-axis marks the 4× jump from 512 to 2048. xLSTM[1:1] is the only configuration that length-generalizes on both tasks: it achieves the highest counting accuracy at every length and solves parity perfectly throughout. Gated DeltaNet with the negative-eigenvalue parameterization of Grazzi et al. (2025) solves parity in-distribution but drops to 0.47 at length 2048; Mamba-2 never solves either.

Time-series Foundation Model Results

Time-series foundation models are benchmarked on GIFT‑Eval and synthetic tasks.

Subquadratic architectures trade quadratic attention for cheaper mixers, but their ability to accumulate state and capture long‑range dependencies determines performance. This section reports how the TSFM variants fare on that trade‑off.

$x\text{LSTM}_{[3:1]}$ attains the lowest $MASE$ and $CRPS$ across all parameter scales up to $40\text{M}$, narrowing the gap to $Mamba\text{-}2$ at $80\text{M}$.

Table 12 shows $x\text{LSTM}_{[3:1]}$ with $MASE=0.566$ and $CRPS=0.521$ at $40\text{M}$, versus $Mamba\text{-}2$ with $MASE=0.735$ and $CRPS=0.714$.

On synthetic counting and state‑tracking tasks (Table 13), $x\text{LSTM}_{[1:0]}$ reaches perfect accuracy on all $A^nB^n$ and parity subtasks up to length $2048$, and maintains high scores on majority and modular‑arithmetic tasks. Gated DeltaNet variants also perform strongly, especially on the $A^nB^nC^n$ counting task at $2048$ where they achieve $0.983$ accuracy.

**Figure 3.** GIFT-Eval performance of TSFM over five parameter scales. MASE and CRPS scores (lower is better) for matched training recipe. xLSTM architectures provide the best scores, with the gap narrowing as the parameter scale grows.

Notation and Definitions

Defines the symbols used throughout and presents full pretraining and distillation results.

The paper introduces a compact symbol set so that every equation, algorithm, and table can be read without repeatedly spelling out the same objects.

**Table 2.** Notation used in this paper.

**Table 2.** Continued list of symbols, definitions, and types for Mamba-2, Gated DeltaNet, activation functions, operators, architecture mappings, and evaluation metrics.

**Table 4.** HumanEval pass@$k$ ($k \in \{2, 8, 16, 64\}$, %) for 400M-parameter inter-layer hybrid variants pretrained on Nemotron-CC-Code-v1 for 100B tokens. Higher is better; the best result per column is shown in **bold**.

**Table.** Reasoning and commonsense results.

The table compares the performance of various models (xLSTM variants, Mamba-2, and Gated DeltaNet) across different pass@k metrics (pass@2, pass@8, pass@16, and pass@64).

**Table 7.** Reasoning and commonsense benchmark results for 400M-parameter inter-layer hybrid variants pretrained on Nemotron-CC-Code-v1 for 100B tokens. Higher is better; the best result per column is shown in bold.

The provided image contains a table comparing the performance of various models (xLSTM variants, Mamba-2, and Gated DeltaNet) on reasoning and commonsense benchmarks, including HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande, using the Nemotron-CC-Code-v1 dataset.

**Table 9.** Full pass@k spread on HumanEval and HumanEval+ for the code distilled students. Higher is better; the best student result per column is shown in bold.

**Table 10.** Math distillation results for students distilled from Qwen3-4B-Instruct with the matched recipe of Hauzenberger et al. (2026). GSM8K and MATH-500 report exact match (Cobbe et al., 2021; Hendrycks et al., 2021; Lightman et al., 2024); AIME 2024 reports pass@8. The Avg. column averages GSM8K, MATH-500, and AIME pass@8. Higher is better; the best student result per column is shown in **bold**.

**Table.** Performance comparison of different student models against the Qwen3-4B-Instruct teacher model across GSM8K, MATH-500, and AIME pass@8 benchmarks.

Read the original paper

Open the simplified reader on Paperglide