Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu, Tri Dao

Mamba replaces attention with selective state spaces to achieve linear-time scaling and Transformer-level quality.

How can we make state space models (SSMs) as expressive as Transformers while maintaining linear-time inference and training?

Transformers scale quadratically with sequence length, making them computationally prohibitive for long-context tasks. Existing subquadratic alternatives like linear attention or structured state-space models (SSMs) often fail to match Transformer performance because they lack content-aware reasoning. Mamba introduces a selection mechanism that allows the model to filter inputs based on content, effectively choosing what to remember or forget. To maintain efficiency, the authors implement this as a hardware-aware parallel scan that avoids materializing the full state in GPU memory. Mamba achieves state-of-the-art performance across language, audio, and genomics, matching Transformers twice its size while providing 5× higher generation throughput.

Paper Primer

The core innovation is the "selective SSM," which makes the model's recurrent dynamics input-dependent. By parameterizing the transition matrices as functions of the input, the model gains the ability to reset its state or ignore irrelevant tokens, overcoming the limitations of static, time-invariant models.

Mamba achieves Transformer-quality performance with linear-time scaling.

In language modeling, Mamba-3B matches the performance of Transformers twice its size on common-sense reasoning benchmarks. 5× higher generation throughput compared to Transformers of similar size.

Selective SSMs enable effective long-context modeling.

Mamba maintains performance improvements on real data up to sequence lengths of 1 million tokens, whereas prior LTI models degrade or fail to scale. Perfect generalization on synthetic induction head tasks up to 4000× longer than training sequences.

Why does the selection mechanism require a new hardware-aware algorithm?

Prior SSMs relied on global convolutions for efficiency, which require time-invariant parameters. Making parameters input-dependent breaks this convolution equivalence, forcing a return to recurrent computation; the authors use kernel fusion and parallel scan to perform this recurrence efficiently on GPUs without excessive memory I/O.

What is the fundamental trade-off Mamba addresses?

The paper frames the sequence modeling challenge as a compression problem: attention is effective but inefficient because it does not compress context, while standard recurrent models are efficient but limited by their inability to selectively compress relevant information. Mamba uses selectivity to achieve the efficiency of recurrence with the content-awareness of attention.

Researchers can now replace Transformer blocks with Mamba layers to achieve linear-time inference and training without sacrificing performance, particularly for tasks requiring long-range dependencies.

The Case for Selective SSMs

Transformers scale quadratically, prompting a linear‑time alternative that retains their performance.

Transformers compute a full attention matrix for every input, so the amount of work and memory grow with the square of the sequence length $N$. This quadratic scaling makes long‑range modeling prohibitively expensive, motivating a linear‑time design that can still match Transformer quality.

Attention evaluates every token against every other token, producing an $N\times N$ similarity matrix; both compute and memory therefore increase as $N^{2}$.

The matrix has $1{,}024^{2}=1{,}048{,}576$ entries.

Memory required = $1{,}048{,}576 \times 4\text{ B} \approx 4\text{ MiB}$.

If we double the length to $N=2{,}048$, entries become $4{,}194{,}304$ and memory jumps to $\approx 16\text{ MiB}$ – a four‑fold increase.

The memory growth is quadratic: modest length increases can exhaust GPU memory, which is why pure attention becomes infeasible for very long sequences.

The Transformer processes an entire token sequence in parallel, using self‑attention so each token can directly gather information from every other token.

The fundamental trade‑off is that longer contexts give richer information but cause quadratic compute and memory growth in Transformers.

Foundations of State Space Models

Structured SSMs compute sequence transformations efficiently by discretizing continuous dynamics and exploiting diagonal state matrices.

Structured SSMs avoid materializing the full latent state $D n$ per token, which would cost $O(B\,L\,D\,n)$ time and memory. This bottleneck motivates the use of structured matrices and discretization to keep computation linear in sequence length.

An SSM treats a sequence as the output of a continuous‑time linear system driven by the input, with a hidden state that evolves over time.

Initialize $h_0=[0,0]$.

Step 1: $h_1 = A h_0 + B x_1 = [1,0]$.

Step 2: $h_2 = A h_1 + B x_2 = [0.5,0]$.

Step 3: $h_3 = A h_2 + B x_3 = [1.25,0]$.

Step 4: $h_4 = A h_3 + B x_4 = [0.625,0]$.

Outputs $y_t = C h_t$ give $[1,0.5,1.25,0.625]$.

This toy run shows how the linear dynamics propagate information forward while the output is a simple sum of the two state components.

Is this SSM the same as a conventional RNN?

No. A conventional RNN learns a separate weight matrix for each time step or uses non‑linear activations, whereas an SSM uses a fixed linear transition $A$ (often diagonal) and a mathematically exact convolution kernel, giving exact linear‑time‑invariant behavior.

S4 imposes a diagonal structure on the state matrix $A$ and applies a closed‑form discretization, turning the continuous dynamics into a convolution kernel that can be evaluated in $O(N\log N)$ time.

Compute $\Delta A = \operatorname{diag}(0.2,0.4,0.6)$.

Exponentiate element‑wise: $\overline{A}= \operatorname{diag}(e^{0.2},e^{0.4},e^{0.6})\approx\operatorname{diag}(1.22,1.49,1.82)$.

Compute $\exp(\Delta A)-I = \operatorname{diag}(0.22,0.49,0.82)$.

Invert $\Delta A$: $(\Delta A)^{-1}= \operatorname{diag}(5,2.5,1.\overline{6})$.

Multiply: $(\Delta A)^{-1}(\exp(\Delta A)-I)=\operatorname{diag}(1.10,1.22,1.37)$.

Finally $\overline{B}= \operatorname{diag}(1.10,1.22,1.37)\,\Delta B = \operatorname{diag}(1.10,1.22,1.37)\times0.2 = [0.22,0.24,0.27]^\top$.

Because $A$ is diagonal, all operations reduce to independent scalar calculations, avoiding any $O(n^2)$ matrix work.

How does S4 differ from a generic structured SSM?

S4 specifically chooses a diagonal $A$ and a closed‑form discretization that yields an analytically tractable convolution kernel. Generic structured SSMs may use low‑rank, block‑circulant, or other factorizations, which do not guarantee the same $O(N\log N)$ kernel computation.

Project each input channel $x$ into per‑channel parameters $(\Delta, A, B, C)$.

Apply the discretization rule to obtain $\overline{A}$ and $\overline{B}$.

During training, compute the convolution kernel $K$ from $(C,\overline{B},\overline{A})$ and evaluate $y = x * K$ in parallel.

During autoregressive inference, run the linear recurrence $h_t = \overline{A} h_{t-1} + \overline{B} x_t$, $y_t = C h_t$ step‑by‑step.

Return the output tensor $y$.

Training uses the convolution formulation $y = x * K$ for full‑sequence parallelism, while inference switches to the recurrence $h_t = \overline{A} h_{t-1} + \overline{B} x_t$, $y_t = C h_t$ to generate tokens one at a time. Both paths inherit the linear‑time‑invariant property, guaranteeing identical outputs for a given input.

**Figure 1.** (**Overview.**) Structured SSMs independently map each channel (e.g. $D = 5$) of an input $x$ to output $y$ through a higher dimensional latent state $h$ (e.g. $N = 4$). Prior SSMs avoid materializing this large effective state ($DN$, times batch size $B$ and sequence length $L$) through clever alternate computation paths requiring time-invariance: the ($\Delta, A, B, C$) parameters are constant across time. Our selection mechanism adds back input-dependent dynamics, which also requires a careful hardware-aware algorithm to only materialize the expanded states in more efficient levels of the GPU memory hierarchy.

**Algorithm 1** SSM (S4) Input: $x : (B, L, D)$ Output: $y : (B, L, D)$ 1: $A : (D, N) \leftarrow \text{Parameter}$ $\triangleright$ Represents structured $N \times N$ matrix 2: $B : (D, N) \leftarrow \text{Parameter}$ 3: $C : (D, N) \leftarrow \text{Parameter}$ 4: $\Delta : (D) \leftarrow \tau_\Delta(\text{Parameter})$ 5: $\overline{A}, \overline{B} : (D, N) \leftarrow \text{discretize}(\Delta, A, B)$ 6: $y \leftarrow \text{SSM}(\overline{A}, \overline{B}, C)(x)$ $\triangleright$ Time-invariant: recurrence or convolution 7: return $y$

The Mamba Selection Mechanism

Selective SSMs add input‑dependent dynamics that let the model compress context efficiently.

Linear‑time‑invariant SSMs cannot decide which inputs to keep, so they waste compute on irrelevant tokens. A selection mechanism makes the dynamics input‑dependent, allowing the hidden state to act as a true compression of the context.

A selection mechanism lets the model decide, based on the current input, which tokens are allowed to influence the hidden state and which are ignored, thereby compressing the context into a smaller, content‑aware state.

Compute $B_t = s_B(x_t)$: a linear map that outputs $[1,0]$ for $x_1$, $[0,1]$ for $x_2$, $[1,1]$ for $x_3$, $[0,0]$ for $x_4$.

Compute $Δ_t = τ_Δ(\text{bias}+s_Δ(x_t))$: yields $Δ_1=5.0$ (large), $Δ_2=0.2$ (small), $Δ_3=4.8$, $Δ_4=0.1$.

When $Δ_t$ is large (steps 1 & 3) the hidden state $h_t$ is reset to $x_t$; when $Δ_t$ is small (steps 2 & 4) $h_t$ simply carries over $h_{t-1}$.

Resulting hidden states: $h_1=[1,0]$, $h_2=[1,0]$, $h_3=[1,1]$, $h_4=[1,1]$.

The mechanism lets the model drop irrelevant tokens (here $x_2$ and $x_4$) by keeping $Δ_t$ small, while preserving salient information by resetting on large $Δ_t$.

How does this differ from the attention mask used in Transformers?

Attention masks silence specific token‑pair interactions but still compute a full attention matrix. The selection mechanism directly removes the unwanted token from the recurrent dynamics, so the hidden state never even sees that token, saving both compute and memory.

Mamba stacks a single block that fuses a linear projection, a selective SSM, and a SiLU activation, replacing the separate attention and MLP blocks of a Transformer.

Linear projection expands $x$ to $z\in\mathbb{R}^4$: $z=[0.5, -0.3, 0.5, -0.3]$.

Selective SSM receives $z$, computes $Δ$ (e.g., $Δ=3.2$) and updates hidden state $h$ accordingly.

SiLU activation $\sigma(z)=z\cdot\text{sigmoid}(z)$ is applied element‑wise to the SSM output.

Residual connection adds the original $x$ (broadcast to 4‑D) and LayerNorm normalizes the result.

The final block output is a 2‑D vector ready for the next Mamba block.

The example shows that a single Mamba block already performs projection, dynamic selection, non‑linearity, and normalization—all in one homogeneous layer.

Is Mamba just a re‑branded gated attention unit?

No. GAU still relies on a static attention matrix, whereas Mamba’s core is a time‑varying SSM whose recurrence parameters are recomputed for every token. This gives true content‑dependent state updates rather than a fixed attention pattern.

**Figure 3.** (Architecture.) Our simplified block design combines the H3 block, which is the basis of most SSM architectures, with the ubiquitous MLP block of modern neural networks. Instead of interleaving these two blocks, we simply repeat the Mamba block homogenously. Compared to the H3 block, Mamba replaces the first multiplicative gate with an activation function. Compared to the MLP block, Mamba adds an SSM to the main branch. For $\sigma$ we use the SiLU / Swish activation (Hendrycks and Gimpel 2016; Ramachandran, Zoph, and Quoc V Le 2017).

Empirical Performance

Mamba’s selection mechanism enables linear‑time modeling that matches Transformer performance.

Recall that Transformers attend uniformly to all tokens, while Mamba introduces a selection mechanism that filters information and runs in linear time. The following results quantify how well this design closes the gap to strong Transformer baselines.

Mamba matches the performance of the strong Transformer++ recipe across model sizes, achieving comparable perplexity while using linear‑time computation.

Figure 5 shows Mamba’s perplexity curve lying on top of Transformer++ for both 2048‑ and 8192‑token settings.

**Figure 4.** (Scaling Laws.) Models of size $\approx 125M$ to $\approx 1.3B$ parameters, trained on the Pile. Mamba scales better than all other attention-free models and is the first to match the performance of a very strong "Transformer++" recipe that has now become standard, particularly as the sequence length grows.

**Table 1.** (Selective Copying.) Accuracy for combinations of architectures and inner sequence layers.

Mamba’s ability to match the strong Transformer++ scaling laws shows that linear‑time selection can replace quadratic attention without sacrificing quality.

Genomic Sequence Modeling

DNA modeling results show Mamba outperforms open‑source baselines on zero‑shot tasks and scales better with size and context.

Mamba achieves the highest average accuracy across all model sizes, leading the open‑source field by 4.3 points.

On the zero‑shot benchmark suite (Table 3), the Mamba‑2.8B model attains 55.6 % average accuracy, while the next best open‑source model (RWKV‑3B) reaches 51.4 %.

**Table 3.** (Zero-shot Evaluations.) Best results for each size in bold. We compare against open source LMs with various tokenizers, trained for up to 300B tokens. Pile refers to the validation split, comparing only against models trained on the same dataset and tokenizer (GPT-NeoX-20B). For each model size, Mamba is best-in-class on every single evaluation result, and generally matches baselines at twice the model size.

**Figure 5.** (DNA Scaling Laws.) Pretraining on the HG38 (human genome) dataset. (Left) Fixing short context length $2^{10} = 1024$ and increasing size from $\approx 200K$ to $\approx 40M$ parameters, Mamba scales better than baselines. (Right) Fixing model size and increasing sequence lengths while keeping tokens/batch and total training tokens fixed. Unlike baselines, the selection mechanism of Mamba facilitates better performance with increasing context length.

Audio Generation Performance

Mamba outperforms SaShiMi on long‑context audio tasks, scaling to million‑token sequences.

Mamba maintains successful autoregressive generation up to sequence length 2²⁰ (≈10⁶) whereas all attention‑based baselines fail beyond 2¹⁵.

Table 5 shows checkmarks for Mamba at every tested length (2⁶–2²⁰) while MHA‑* models lose marks after 2¹⁵.

The scaling advantage stems from Mamba’s selection mechanism, which filters irrelevant tokens and keeps the per‑step cost linear in sequence length.

**Figure 6.** (Great Apes DNA Classification.) Accuracy after finetuning on sequences of length $2^{10} = 1024$ up to $2^{20} = 1048576$ using pretrained models of the same context length. Numerical results in Table 13.

**Figure 7.** (Audio Pretraining.) Mamba improves performance over prior state-of-the-art (Sashimi) in autoregressive audio modeling, while improving up to minute-long context or million-length sequences (controlling for computation).

Table 4 confirms that Mamba’s UNet‑style architecture yields the best fidelity on SC09, achieving the lowest FID and AM while delivering the highest MIS among all baselines.

Table 5 further demonstrates that swapping S4+MLP blocks for Mamba blocks in any stage improves both quality and stability, reinforcing the selection mechanism’s universal benefit.

Computational Efficiency

Key efficiency gains of Mamba: massive speedups and linear‑time scaling.

Recall that Mamba’s selection mechanism lets it filter information dynamically, turning the quadratic attention cost of Transformers into a linear‑time process.

Mamba achieves up to 5× higher inference throughput than Transformers of similar size.

Inference throughput measured on batch sizes 1–128 shows the 1.4 B‑parameter Mamba model consistently processing roughly five times more tokens per second than a 1.3 B‑parameter Transformer.

**Figure 8.** (Efficiency Benchmarks.) (Left) Training: our efficient scan is 40× faster than a standard implementation. (Right) Inference: as a recurrent model, Mamba can achieve 5× higher throughput than Transformers.

**Table 2.** (Induction Heads.) Models are trained on sequence length $2^8 = 256$, and tested on increasing sequence lengths of $2^6 = 64$ up to $2^{20} = 1048576$. Full numbers in Table 11.

Activating all three selective parameters ($\Delta$, B, C) lowers perplexity from 10.93 to 8.71.

Table 2 reports perplexity 10.93 with no selective parameters and 8.71 when $\Delta$, B, and C are all enabled.

Table 5 (SC09 Model Ablations) shows that replacing the outer S4+MLP block with a selective SSM (S6) consistently improves NLL and FID across all configurations, confirming the motivation of Section 3.

Table 6 (Architecture and SSM layer) further isolates the effect of the inner SSM: selective SSMs (S6) achieve lower NLL (≈1.85) than any non‑selective variant, while the overall Mamba block matches the simpler H3 baseline.

Ablation Studies

We quantify how each component removal impacts language‑model perplexity.

We evaluate each architectural choice by measuring test‑set perplexity on a language‑model benchmark.

Replacing the Hyena H3 layer with the S6 layer reduces perplexity by 1.29 points.

S6 achieves 8.95 versus 10.24 for Hyena H3.

The S4‑complex layer within the Mamba architecture improves perplexity by 2.06 points over the baseline Mamba block.

8.69 versus 10.75.

Activating all three selective parameters ($\Delta$, B, C) cuts perplexity by 2.22 points relative to the fully static baseline.

8.71 versus 10.93.

Switching from the standard complex‑valued initialization to a simple real‑valued diagonal initialization ($A_n = -(n+1)$) improves perplexity by 0.45 points.

9.16 versus 8.71.

Projecting $\Delta$ to dimension 64 yields the lowest perplexity (8.71) while increasing parameters by only 12.6 M.

8.71 versus 9.12 with a 371.5 M vs 358.9 M parameter count.

Discussion and Limitations

We examine related work, limitations, and future directions for the selection mechanism.

We situate our selection mechanism within prior work, outline its current constraints, and sketch promising avenues.

Projecting the $\Delta$ vector to a single dimension already yields a large perplexity reduction; expanding to higher dimensions brings modest further gains at a small parameter cost.

Table 9 shows perplexity dropping from 9.12 at dim 1 to 8.71 at dim 64 while parameters increase from 359.1 M to 371.5 M.

While the selection mechanism improves performance on discrete sequences such as text and DNA, it can degrade results on continuous‑time signals where traditional LTI SSMs excel — akin to a traffic light that efficiently directs cars at a busy intersection but creates bottlenecks on a smooth highway.

We also ask whether SSM‑based backbones inherit the rich downstream affordances of Transformers—fine‑tuning, prompting, RLHF, and quantization—observed in large language models.

Our experiments are confined to modest model sizes; we have not evaluated Mamba at the 7 B‑parameter scale where competing recurrent models have been benchmarked — testing only on a city‑block route and assuming the same fuel efficiency on a cross‑country trip.

Extending selective state‑space models to domains requiring very long context—genomics, audio, video—represents a key direction for future research.

Technical Appendices

Additional context, related work, and implementation details for the selection mechanism.

The selection mechanism sits at the intersection of gating, hypernetworks, and data‑dependence, yet it remains a distinct operation that enables input‑dependent filtering for linear‑time sequence modeling.

Historically, gating referred to the multiplicative control gates of LSTM and GRU that regulate signal flow through time; today “gating” is often used loosely for any elementwise multiplication, even when no temporal interaction occurs.

Hypernetworks generate parameters of a target network via a smaller auxiliary network, a pattern first explored for recurrent layers and later generalized across architectures.

Data‑dependence describes any situation where model parameters are functions of the current input, exemplified by a diagonal linear layer $y = D x$ with $D = \sigma(Wx)$, which collapses gating, hypernetworks, and activation into a single GLU‑like operation.

We reserve the term “selection” for mechanisms that explicitly choose to keep or discard inputs along the sequence dimension, a view that aligns with classic RNN gating (Theorem 1) and with input‑dependent discretization of $\Delta$ in SSMs.

Related work spans structured SSMs, linear‑attention models, and long‑context architectures, each addressing the quadratic scaling of vanilla Transformers in different ways.

S4 introduced diagonal‑plus‑low‑rank (DPLR) structure; subsequent variants such as DSS, S4D, and S5 refined initialization and recurrence schemes, while our S6 retains the scan core but adds a learnable selection gate.

Architectural hybrids like GSS, H3, and RetNet combine SSM blocks with attention or MLP pathways; Selective S4 applies a binary mask to inputs but lacks the full input‑dependent state expansion of our approach.

Traditional gated RNNs (e.g., LSTM, GRU) can be viewed as early selective SSMs, differing mainly in that they keep a fixed state dimension ($N=1$) and lack the input‑dependent $\mathbf{B},\mathbf{C}$ parameters that our selection mechanism introduces.

Linear attention reframes softmax attention as a kernel trick, spawning variants such as Random Feature Attention, Performer, TransNormer, cosFormer, and Linear Randomized Attention, all of which trade exactness for $O(L)$ complexity.

Long‑context models like Recurrent Memory Transformer, LongNet, HyenaDNA, and Sparse Transformer push sequence lengths toward $10^6$ tokens, yet many evaluate only synthetic tasks or fail to control for compute and data scaling.

Mechanically, Theorem 1 shows that with $N=1$, $A=-1$, $B=1$, and a softplus‑scaled step size, the discrete recurrence reduces to $g_t=\sigma(\text{Linear}(x_t))$, $h_t=(1-g_t)h_{t-1}+g_t x_t$, i.e., a learned input‑dependent interpolation.

Our hardware‑aware implementation fuses discretization, parallel associative scan, and output multiplication into a single GPU kernel, cutting memory traffic by a factor of $N$ and achieving 20–40× speedups over naïve scans.

**Table 12.** (Scaling Law Model Sizes.) Our model sizes and hyperparameters for scaling experiments. (Model dimension and number of heads applies only to Transformer models.)

Synthetic benchmarks include a 4096‑step selective copying task (vocab 16) and an induction‑heads task, both trained with Adam at $2\times10^{-4}$ or $1\times10^{-3}$ learning rates and evaluated after fixed epochs.

Language‑model scaling follows the GPT‑3 recipe on the Pile, comparing Transformers, Transformers++, Hyena, H3++, RWKV, RetNet, and Mamba across four model sizes (125 M–1.3 B parameters) as detailed in Table 12.

DNA modeling adapts the Enformer data splits, training models from 250 K to 40.7 M parameters with sequence‑length warmup and constant token budget, then evaluates species classification accuracy up to $2^{20}$ tokens.

Audio experiments pretrain a 15‑block Mamba stack on YouTubeMix (≈ 1 M‑token batches) and fine‑tune on SC09 speech, noting that larger models overfit the small speech dataset but still improve sample quality over training.

Efficiency benchmarks compare our fused scan against a baseline parallel scan, convolution, and FlashAttention, showing up to 7× faster inference at 32 K tokens and memory usage comparable to the best Transformer implementation.

Read the original paper

Open the simplified reader on Paperglide