Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Subspace-Aware Sparse Autoencoders (SASA) replace single-vector decoders with learned subspaces to eliminate feature splitting.

How can we modify Sparse Autoencoders to represent multi-dimensional semantic features as single subspaces rather than splitting them across many redundant one-dimensional vectors?

Standard Sparse Autoencoders (SAEs) assume semantic features are one-dimensional, forcing them to fragment a single multi-dimensional concept across many redundant, near-collinear decoder vectors. SASA replaces these single-vector decoders with learned decoder subspaces and enforces sparsity at the group level, allowing a single block to capture an entire multi-dimensional feature. This consolidation improves monosemanticity and reduces the required training token budget by approximately half compared to standard SAEs.

Paper Primer

The paper identifies a geometric mismatch: while modern Large Language Models (LLMs) encode features as low-dimensional subspaces, standard SAEs use an $\ell_1$-regularized objective that actively drives the dictionary to "tile" these subspaces with many individual vectors. This splitting is not just a failure of training, but a stable optimum of the standard SAE objective, which makes interpretability fragmented and inefficient.

SASA solves this by grouping latents into blocks and applying a nuclear-norm regularizer to the reconstruction map of each group. SASA is like a specialized filing system: instead of storing individual pages in separate folders, it assigns a single, flexible binder to each multi-dimensional topic, using group-level gating to ensure only the most relevant binders are active.

SASA achieves superior feature recovery and monosemanticity while training on significantly fewer tokens.

On GPT-2 and Mistral-7B, SASA matches or exceeds standard SAE performance on explained variance and KL scores using roughly 50% of the token budget. Feature absorption—a key indicator of splitting—was reduced from 37.2% to 6.6% on GPT-2 and from 18.3% to 11.9% on Mistral-7B.

Why does the standard SAE objective actively prefer splitting features?

The coordinate-wise $\ell_1$ penalty creates a residual that no single basis vector can absorb, forcing the model to add more vectors to cover the feature's geometry. SASA’s group-level spectral penalty couples these coordinates, making it mathematically cheaper to represent the feature as a single coherent subspace.

What is the primary bottleneck SASA addresses regarding training efficiency?

Training SAEs requires massive amounts of LLM activations, each costing a full forward pass. By reducing the sample complexity from exponential in the feature dimension to polynomial, SASA requires fewer activations to achieve the same reconstruction error, directly lowering the total compute cost.

Researchers should shift from vector-based dictionary learning to subspace-based methods when interpreting LLMs, as this aligns the interpretability tool with the actual multi-dimensional geometry of model representations.

Introduction

We expose why treating features as single vectors fragments multi‑dimensional concepts and introduce subspace‑aware autoencoders to fix it.

Mechanistic interpretability of large language models relies on disentangling hidden representations, but current sparse autoencoders force every latent feature into a single direction. This one‑dimensional assumption clashes with the inherently multi‑dimensional geometry of many concepts, leading to redundant, fragmented latents that obscure true structure.

A model that learns a tiny set of decoder vectors to reconstruct high‑dimensional activations, each vector intended to capture a distinct semantic feature.

Standard SAEs treat every semantic feature as a one‑dimensional direction, which forces the model to split multi‑dimensional concepts across many redundant vectors; SASA instead learns low‑dimensional subspaces to capture these features holistically.

The limitation of one-dimensional feature representation in current interpretability tools.

**Figure 8.** Mistral SASA Group 1570 Activation Profiles. The group consistently activates on geographical tokens.

**Figure 10.** **SASA Group 1056 — Sports subspace.** AutoInterp labels this group as *Sports and athletic activity terms*. A 3D PCA view separates combat/action, titles/achievement, and general sports contexts (e.g., sport, athletic).

Problem Formulation

Defines the subspace model of activations and why standard SAEs struggle.

Standard sparse autoencoders treat each semantic feature as a single direction. When a feature actually lives in a low‑dimensional subspace, this forces the model to split the feature across many nearly collinear directions, inflating redundancy and sample complexity.

A subspace is like a flat tabletop where a point can slide anywhere on the surface, whereas a single direction is a narrow hallway that only permits motion along one line.

How does a subspace differ from a single direction in practice?

A single direction can only represent variations along one axis, while a subspace spans multiple independent axes. Consequently, a subspace can capture richer variations of a feature without needing many separate directions.

Compute $\mathbf{V}_1\mathbf{z}_1 = (2,\,-1,\,0,\,0)^\top$.

Compute $\mathbf{V}_2\mathbf{z}_2 = (0,\,0,\,3,\,0)^\top$.

Sum the two projections: $(2,\,-1,\,3,\,0)^\top$.

Add noise $\boldsymbol{\xi}$ to obtain $\mathbf{h}= (2.1,\,-1.2,\,3.05,\,0.0)^\top$.

This toy example shows how a single activation can be expressed as a combination of low‑dimensional subspace components plus a small residual, illustrating the superposition hypothesis without inflating the number of directions.

SAE encodes an activation into a sparse latent vector where each latent controls a single direction, then decodes by linearly mixing those directions back into the original space.

Why does a single‑direction latent cause redundancy when the true feature lives in a higher‑dimensional subspace?

Each latent can only capture variation along its own direction. To represent a subspace of dimension $d_i>1$, the SAE must allocate $d_i$ separate latents, each learning a nearly collinear direction, which wastes parameters and hampers sample efficiency.

The Necessity of Subspace Learning

Standard SAEs force multi‑dimensional features to split; learning subspaces avoids this.

Standard SAEs force every semantic feature into a single direction. When a feature truly spans a $d_i$‑dimensional subspace, this single‑direction assumption forces the model to split the feature across many near‑collinear atoms, inflating the dictionary size.

When a decoder can only activate $k$ directions per input but the true feature lives in a $d_i$‑dimensional subspace with $d_i>k$, the model must allocate multiple atoms to cover the subspace, each handling a different slice of the feature.

Compute the lower bound $L_i^{(\varepsilon)} \ge (t/\varepsilon)^{(d_i-k)/k}= (1/0.1)^{1}=10$ atoms.

Place 10 unit vectors evenly around the circle (every $36^\circ$).

Each vector covers an arc of length $2\varepsilon=0.2$, guaranteeing any point lies within $\varepsilon$ of some atom’s span.

Reconstruction proceeds by selecting the nearest atom and scaling it to match the input magnitude.

The decoder thus uses 10 atoms despite the feature being only 2‑dimensional.

Even a modest accuracy requirement forces the dictionary to grow linearly with $t/\varepsilon$, illustrating why a single direction cannot capture a multi‑dimensional feature.

Why does limiting the co‑activation budget $k$ to 1 force the model to use many atoms?

With $k=1$ the decoder can only span a single direction per input. To approximate every point on a $d_i$‑dimensional manifold, it must place atoms densely enough that each point lies within $\varepsilon$ of some atom’s direction. The covering bound shows the required number grows as $(t/\varepsilon)^{(d_i-1)}$, which is exponential when $d_i>1$.

Even though activations live in a high‑dimensional ambient space, most of their variance is captured by a low‑dimensional subspace; the dimension of that subspace is the intrinsic dimensionality.

How does intrinsic dimensionality differ from the ambient dimension of the activation space?

The ambient dimension is the total number of coordinates (e.g., 768 for GPT‑2). Intrinsic dimensionality counts only the directions that actually carry signal; the remaining directions are essentially noise and can be ignored without losing much information.

The redundancy ratio measures how many decoder atoms are needed relative to the intrinsic dimensionality of the subspace they aim to capture; a ratio near 1 indicates an efficient representation.

Why is a redundancy ratio larger than 1 considered undesirable?

A ratio > 1 means the decoder allocates extra atoms that do not increase the captured variance; those atoms are redundant, increasing memory and compute without improving representation quality.

**Figure 1.** Standard SAEs split a multi-dimensional feature across many near-collinear atoms, while SASA captures it as a single subspace. We embed three ground-truth concept manifolds—a circle ($d_i = 2$), a sphere $S^2$ ($d_i = 3$), and a helix ($d_i = 3$)—into an ambient space of dimension $d = 64$ (with 5% noise) and fit six dictionaries of width 256. *First column:* each manifold colored by its underlying concept value. *Next five columns:* standard vector-based SAEs (ReLU, TopK, BatchTopK, JumpReLU, Gated), in which every latent is tied to a single decoder direction. Each point is colored by the decoder atom most aligned with it. Under the vector-based assumption, the feature is *not* captured by one direction but is instead distributed across tens to hundreds of near-duplicate atoms, each explaining only a local slice of the manifold. Hence, interpreting the feature requires aggregating a whole cluster of latents rather than inspecting a single unit. *Last column:* SASA, which learns decoder *subspace* as the unit of representation. With the same total width, a single active group of effective rank $d_i$ (one latent) covers the entire feature, recovering its intrinsic geometry rather than fragmenting it.

The geometric analysis (Theorem 4) shows that any decoder restricted to $k<d_i$ directions must allocate at least $C (t/\varepsilon)^{(d_i-k)/k}$ atoms, which explodes as $k$ moves away from $d_i$. The subsequent optimization analysis (Theorems 6–8) proves that the SAE loss actively drives the decoder toward this fragmented regime, making splitting both necessary and optimal under the standard objective.

**Figure 4.** Intrinsic dimensionality in raw GPT-2 activations (no SAE involved). PCA on controlled concept prompts confirms compact subspaces within the 768-dimensional activation space. (a) Temporal slice with dim90 = 14. (b) Geography slice with dim90 = 33.

**Figure 6.** **Redundancy Ratio of Mistral-7B SAE Decoder Clusters.** The left panel shows cluster size vs PCA dimension (capturing 80% variance). The right panel shows a histogram of redundancy ratios. The median ratio of 1.67 suggests features are often split across multiple collinear vectors, indicating inefficiency.

Neural activations naturally reside in low‑dimensional subspaces, so forcing a single‑direction decoder leads to unnecessary feature splitting.

Subspace-Aware Sparse Autoencoders

We replace per‑coordinate sparsity with block‑wise sparsity, letting each subspace capture a whole feature.

Standard sparse autoencoders force each semantic feature into a one‑dimensional direction, which forces the model to split a multi‑dimensional concept across many redundant vectors. This fragmentation inflates the decoder budget and hurts representation fidelity. SASA eliminates that pain by moving sparsity from individual coordinates to whole blocks.

Instead of sparsifying each scalar in the code, SASA sparsifies whole blocks, letting a single active block span the entire low‑dimensional subspace of a feature.

How does SASA differ from a standard SAE that applies an $\ell_1$ penalty coordinate‑wise?

Standard SAEs penalize each scalar in the code independently, which forces a multi‑dimensional feature to be split across many atoms. SASA instead penalizes the whole block via a nuclear‑norm term, allowing a single block to represent the entire subspace without internal penalty.

The gate selects block 3 because it has the largest norm $3.16$.

Block‑sparse latent $a_3(h)=p_3(h)=(3,1)$; $a_1(h)=a_2(h)=0$.

Assume decoder matrices $D_1,D_2,D_3$ are identity‑scaled: $D_3=\begin{bmatrix}1&0\\0&1\\0&0\end{bmatrix}$.

Reconstruction $\hat a(h)=D_3 a_3(h)=(3,1,0)^\top$.

The reconstruction error $\|h-\hat a(h)\|_2$ is reduced compared to any split across multiple blocks because the whole feature lives in block 3.

The gate’s scalar norm criterion lets a single block capture the full feature, avoiding the need to distribute the representation across several blocks.

Proposition 10 shows that if a block’s width $r$ is at least the feature’s intrinsic dimension $d_i$, a single block can represent the entire feature slice with zero reconstruction error. Thus the capacity bottleneck that forced standard SAEs to fragment the feature disappears.

Theorem 11 strengthens this by proving that, under the SASA objective, every global minimiser activates exactly one block whose decoder spans the top‑$d_i$ eigenspace of the feature covariance. This contrasts with Section 4.2, where the $\ell_1$ penalty forced a fragmented allocation that could never be optimal.

Top‑$s$ gate: select the $s$ blocks with largest $\ell_2$ norm.

Sample Complexity Efficiency

SASA dramatically cuts the samples needed for accurate feature learning, speeding up training.

SASA reduces the number of samples required to learn accurate feature representations by roughly an order of magnitude compared to a standard SAE.

From Theorem 13 SASA needs $\tilde{O}(d^{2}/\varepsilon^{2})$ samples, while Proposition 12 and the covering bound for a standard SAE require $N \ge C\,(t/\varepsilon)^{d_i-1}$, an exponential dependence on the intrinsic dimension.

Standard SAEs must gather enough activations to cover every one‑dimensional direction, which forces $n \ge N\log N$ samples (Proposition 12). In contrast, SASA treats each feature as a low‑dimensional subspace, so the required sample count scales only with $d^{2}$ (Theorem 13). The empirical Mistral‑7B timing illustrates the practical impact: acquiring $1\,$M tokens costs ≈196.8 s, whereas a single SAE forward pass is only 3.3 s and a backward pass 4.6 s.

SASA’s lower sample complexity translates into substantially faster end‑to‑end training.

Experimental Results

We show that SASA halves the training token budget while preserving accuracy.

Standard SAEs treat each semantic feature as a single direction; SASA instead learns a low‑dimensional subspace for each feature, enabling more compact representations.

SASA matches or exceeds standard SAE performance while using only half the training token budget.

Table 1 shows SASA reaches 98 % KL and CE scores with 150 M tokens versus 300 M for the standard SAE on GPT‑2, and similar gains on Mistral‑7B.

**Figure 9.** Geometry of the Geographical Subspace. A PCA projection of the latent activations in Mistral SASA Group 1570. The subspace organizes geographical concepts into distinct clusters, preserving the hierarchical distinction between cities (blue), countries (orange), and continents (green).

Related Work

We survey prior work on mechanistic interpretability, representation geometry, and sparse autoencoders.

Mechanistic interpretability seeks concrete, causal explanations of neural network behavior, typically validated by targeted interventions such as ablations and activation patching. Wang et al. (2022) showed that complex behaviors can be broken down into compact circuits that remain functional under such controlled edits, and Conmy et al. (2023) began automating this workflow by automatically locating sparse, behavior‑preserving subgraphs in the computational graph.

Despite these advances, two core challenges persist: how to decompose networks into components that are both meaningful and causally relevant, and how to rigorously validate mechanistic hypotheses so they do not become interpretability illusions (Sharkey et al., 2025).

In parallel, a line of work on LLM representations has evolved from a one‑dimensional view—treating concepts as linear directions—to multi‑dimensional hypotheses where concepts occupy low‑dimensional subspaces or structured geometric objects such as simplices. This shift matters because it provides concrete geometric targets for probing, intervening, and steering models.

Modell et al. (2025) conjecture that the phenomenon of feature splitting is a symptom of inherently multi‑dimensional features: when a true feature lives in a low‑dimensional subspace, standard vector SAEs approximate that subspace with multiple dictionary vectors. Engels et al. (2025) and Bhalla et al. (2026) provide empirical and theoretical support for this claim, showing that SAEs fragment a single manifold across many partially redundant atoms.

Bhalla et al. (2026) formalize what it means for an SAE to capture a concept manifold and prove that an ideal sparse decoder over an incoherent dictionary can recover the manifold subspace. In practice, however, trained SAEs settle into a fragmented “dilution” regime, a consequence of the $\ell_1$ SAE objective actively rejecting the subspace basis.

The superposition hypothesis explains why neuron‑level interpretations often fail: many sparse features are packed into a limited activation space, producing polysemantic units. Dictionary‑learning methods—including SAEs—aim to recover a more feature‑aligned basis, but training them is costly because they typically require large amounts of hidden‑state data (Leask et al., 2025).

Standard SAEs implement a vector dictionary with sparsity enforced by mechanisms such as ReLU, JumpReLU, gating, Top‑k, BatchTopK, and Matryoshka. Scaling analyses (Michaud et al., 2025) reveal that when features have manifold structure, many vectors are allocated to a single manifold, yielding far fewer distinct learned features than the model width would suggest. This motivates moving beyond vector dictionaries toward subspace‑level features using structured sparsity and low‑rank control.

Subspace Validation Results

Low‑dimensional subspace validation shows SASA’s assumptions hold and standard SAEs waste capacity.

Standard SAEs treat each semantic feature as a single direction; SASA learns low‑dimensional subspaces to capture features holistically. This appendix reports the ablations that test those claims.

Mixture of subspaces model captures more variance than global PCA at the same rank.

At $K = 256$ clusters, it explains 83.95 % of held‑out variance versus 74.46 % for global PCA (rank 256).

These ablations confirm that low‑dimensional subspaces better model activations and that standard SAEs waste capacity via feature splitting.

Additional Experiments

Additional experiments quantify SASA’s subspace benefits and feature‑splitting reductions.

We extend the core SASA evaluation with three families of analyses: (i) absorption of sparse probes on the first‑letter benchmark, (ii) selective group screening for temporal and geographical concepts, and (iii) geometric inspection of the learned subspaces.

Mixture‑of‑subspaces reconstruction captures 83.95 % of held‑out variance, surpassing global PCA’s 74.46 % at the same rank.

Figure 5 shows the teal curve (top‑min(16, K) span mixture) overtaking the global‑PCA baseline between K=32 and K=64 and ending higher at K=256.

**Figure 5.** Mixture-of-subspaces reconstruction in raw GPT-2 activations. Rank-16 local PCA at $K = 256$ clusters captures 83.95% of held-out variance, exceeding global PCA at rank 256 (74.46%).

**Figure 7.** Redundancy Ratio of GPT-2 SAE Decoder Clusters. The median redundancy ratio of 2.18 highlights significant feature splitting, where standard SAEs use excess vectors to represent lower-dimensional subspaces, wasting model capacity.

**Table 4.** Activation values and corresponding OpenWebText prompts for GPT-2 SASA Group 1473.

Theoretical Proofs I

Detailed proofs of the geometric lemmas and main theorems.

This appendix gathers the full proofs of every geometric lemma and the main theorems referenced in the body of the paper.

For any $h\in V_i$ and any linear subspace $W\subset\mathbb{R}^d$, $\operatorname{dist}\bigl(h, P_{V_i}W\bigr)\le \operatorname{dist}(h,W)$.

Let $W\subset\mathbb{R}^d$ be a $k$‑dimensional linear subspace with $1\le k<d$, and let $\beta\in(0,\pi/2)$. Define the angular tube $\operatorname{Tube}_{\beta}(W)=\{u\in S^{d-1}:\operatorname{dist}(u,W)\le \sin\beta\}$. Then $\sigma_{d-1}\bigl(\operatorname{Tube}_{\beta}(W)\bigr)\le C_T(d,k)\,\beta^{\,d-k}$, where $C_T(d,k)=\frac{\sigma_{k-1}(S^{k-1})\,\sigma_{d-k-1}(S^{d-k-1})}{d-k}$.

For $d\ge2$, any $u_{0}\in S^{d-1}$ and any $\rho\in(0,\pi/2)$, there exists a constant $c_{\rho}>0$ (depending on $d$ and $\rho$) such that $\sigma_{d-1}\bigl(B_{\gamma}(u_{0},\rho)\bigr)\ge c_{\rho}$.

Proof of Theorem 4. The goal is to lower‑bound the number $N$ of unit decoder directions needed to $\varepsilon$‑cover the single‑feature slice $M_i(t)$ with $k$‑dimensional spans. By Lemma 14 we may project any competing subspace onto $V_i$ without increasing distance, so each admissible $k$‑subset $J\subset[N]$ defines a tube $ \operatorname{Tube}_{\beta}(W_{J})$ with $\beta=\arcsin(\varepsilon/t)$. Lemma 15 bounds each tube’s surface measure, while Lemma 16 guarantees the cap $B_{\gamma}(u_{0},\rho)\cap V_i$ has positive measure $c_{\rho}$. Combining these yields $c_{\rho}\le k\,N^{k}\,C_T(d,k)\,\beta^{\,d-k}$, which rearranges to the claimed lower bound on $N$.

Proof of Proposition 5. Under Hypothesis 1 the feature slice $h=V_i z_i+\delta$ splits into an in‑subspace component $V_i z_i$ and an out‑of‑subspace perturbation $\delta$. Lemma 14 lets us replace the full‑space spans in Theorem 4 by their projections $W_{J}=P_{V_i}\operatorname{span}\{d_{j}\}_{j\in J}$, so the only extra error comes from $P_{V_i}\delta$. Using the $\mu$‑coherence bound $\|V_i^{\top}V_j\|_{2}\le\mu$ and the activation bound $\|z_j\|_{2}\le t$, we obtain $\|P_{V_i}\delta\|_{2}\le\mu s t+\eta$. Adding this to the original $\varepsilon$ gives an effective covering error $\varepsilon'=\varepsilon+\mu s t+\eta$, and the same counting argument as in Theorem 4 applies with $\varepsilon'$.

Lemma 17 (Basis residual under full activation). For the basis decoder $D_{\text{basis}}=[v_{1},\dots,v_{d}]$, the $\ell_{1}$‑regularised reconstruction problem has the closed‑form solution $a_{k}^{*}= \bigl(\operatorname{sign}(u_{k})\,t|u_{k}|-\lambda\bigr)_{+}$. On the full‑activation set $A=\{u:|u_{k}|>\lambda/t\ \forall k\}$ the residual simplifies to $r(h)=\lambda\sum_{k=1}^{d}\operatorname{sign}(u_{k})\,v_{k}$ with norm $\|r(h)\|_{2}= \lambda\sqrt{d}$.

Proof of Theorem 6. Write the extra decoder column as $d=\sum_{k}c_{k}v_{k}$ with $\|c\|_{2}=1$ and at least two non‑zero entries. On the orthant $U_{c}$ where the signs of $u$ match those of $c$, the inner product $d^{\top}r(h)=\lambda\|c\|_{1}$ exceeds $\lambda$ because $\|c\|_{1}>\|c\|_{2}=1$. Hence the $\ell_{1}$ stationarity condition $|d^{\top}r|\le\lambda$ is violated, showing that the basis decoder cannot be optimal.

Proof of Proposition 7. Starting from $D=[D_{\text{basis}},d]$, keep the basis coefficients $a^{*}$ fixed and optimise the extra coordinate $a_{d+1}$. The resulting cost $L(a_{d+1})$ is strictly convex with minimiser $a_{d+1}^{*}= \lambda(\|c\|_{1}-1)>0$, yielding a lower loss $L_{\text{basis}}(h)-\frac{\lambda^{2}}{2}(\|c\|_{1}-1)^{2}$. Since the global optimum cannot be worse, the risk of the augmented decoder is strictly lower than that of the pure basis decoder.

Proof of Theorem 8. Choose $d=\frac{1}{\sqrt{d}}\sum_{k}v_{k}$ and rotate an inactive column $d_{j}$ towards $d$ along the geodesic $d_{j}(\alpha)$. For angles $\alpha\le\alpha_{0}$ the new column remains inactive on the orthant $U_{+}$, so the risk stays flat. Once $\alpha>\alpha_{0}$ the activation condition is breached, the per‑sample loss drops by at least $\frac{\lambda^{2}}{2}(f(\alpha)-1)^{2}$, and integrating over $U_{+}$ (which has probability $2^{-d}$) yields a strictly positive risk reduction. Thus the risk curve is flat up to $\alpha_{0}$ and strictly decreasing thereafter.

Theoretical Proofs II

We prove Proposition 10 and Theorem 11, establishing the optimality of a single active subspace in SASA.

Appendix A.2 supplies the missing proofs for the SASA analysis. First we establish Proposition 10, which shows that a single active subspace can reconstruct any feature in its slice without error.

Let $V_i=[v_1,\dots,v_{d_i}]$ be an orthonormal basis of the subspace $V_i$, and form the block decoder $D_k^\star=[v_1,\dots,v_{d_i},0,\dots,0]\in\mathbb{R}^{d\times r}$ (the remaining $r-d_i$ columns are zero). Because $r\ge d_i$, $V_i\subseteq\operatorname{col}(D_k^\star)$, so every $h\in M_i(t)$ lies in $\operatorname{col}(D_k^\star)$ and the distance $\operatorname{dist}(h,\operatorname{col}(D_k^\star))$ is zero.

Questions & answers

What is the main contribution of the SASA paper?

SASA introduces a sparse autoencoder architecture that replaces each single-vector decoder latent with a learned low-dimensional subspace block and enforces sparsity at the group level via a nuclear-norm regularizer, allowing one block to capture an entire multi-dimensional semantic feature instead of fragmenting it across many redundant vectors.

What problem does SASA address, and why does it matter?

Standard SAEs assume every semantic feature is one-dimensional, but LLM activations naturally encode features as low-dimensional subspaces; this mismatch forces standard SAEs to split a single multi-dimensional concept across many near-collinear dictionary vectors, inflating redundancy, wasting parameters, and obscuring true structure. SASA corrects this geometric mismatch to improve monosemanticity and interpretability.

Why does the standard SAE objective actively prefer splitting features rather than representing them as subspaces?

The coordinate-wise ℓ₁ penalty creates a residual that no single basis vector can fully absorb, so the optimizer adds more vectors to cover the feature's geometry—a fragmented allocation that is a stable optimum of the standard objective. SASA's group-level spectral (nuclear-norm) penalty couples coordinates, making it mathematically cheaper to represent the feature as one coherent subspace.

How does SASA's architecture differ from a standard SAE?

Instead of a dictionary of individual decoder vectors each penalized by a scalar ℓ₁ term, SASA groups latents into blocks of width r, applies a nuclear-norm regularizer to the reconstruction map of each block, and uses a Top-1 group gate so that only the most relevant block is active per input. This allows a single block to span an entire multi-dimensional feature subspace.

What theoretical guarantees does the paper provide for SASA?

Proposition 10 proves that if a block's width r is at least the feature's intrinsic dimension dᵢ, a single block can reconstruct any point in the feature slice with zero error. Theorem 11 proves that under the SASA objective every global minimizer activates exactly one block whose decoder spans the top-dᵢ eigenspace of the feature covariance, with no spurious local minima in the resulting landscape.

How does SASA improve sample complexity compared to standard SAEs?

Standard SAEs require at least n* = N(ln N + log(1/δ)) samples to cover all decoder directions (Proposition 12), which is exponential in the feature dimension dᵢ. SASA's sample requirement scales only as dᵢ² (Theorem 13), reducing the token budget needed to achieve the same reconstruction error by approximately half in practice.

What datasets and models were used in the experiments?

SASA was trained on residual-stream activations from GPT-2 Small (d=768, mid-layer hook blocks.7.hook_resid_pre) using OpenWebText with a 150M token budget, and from Mistral-7B-v0.1 (d=4096, mid-layer hook blocks.8.hook_resid_pre) using The Pile with a 500M token budget; LLM weights were kept frozen throughout.

What are the key architectural hyperparameters used to train SASA?

Hyperparameters are expressed as (K, r, s): for GPT-2, (K=2048, r=6, s=10) giving total width m=12,288 and sparsity ℓ₀=60; for Mistral-7B, (K=4096, r=8, s=10) giving m=32,768 and ℓ₀=80. The auxiliary regularization weight is λ_aux=1, and dead groups are identified using a running frequency threshold ν=10⁻⁴ over a 1,000-token window.

What concrete efficiency gain does SASA provide in practice?

The paper reports that acquiring 1M tokens from Mistral-7B costs approximately 196.8 seconds per forward pass, and SASA's reduced sample complexity translates into requiring roughly half the training token budget of a standard SAE to reach the same reconstruction error.

What additional experiments does the paper report beyond the core evaluation?

The paper includes three additional analyses: (i) absorption of sparse probes on a first-letter benchmark, (ii) selective group screening for temporal and geographical concepts, and (iii) geometric inspection of the learned subspaces. Ablations also confirm that low-dimensional subspaces better model activations and that standard SAEs waste capacity via feature splitting.

What are the limitations or open questions acknowledged by the paper?

The paper does not explicitly enumerate its own limitations in a dedicated section. It acknowledges that two core challenges in mechanistic interpretability persist broadly: decomposing networks into components that are both meaningful and causally relevant, and rigorously validating mechanistic hypotheses to avoid interpretability illusions (citing Sharkey et al., 2025).

How does SASA relate to prior work on feature splitting and multi-dimensional representations?

Modell et al. (2025) conjectured that feature splitting is a symptom of inherently multi-dimensional features; Engels et al. (2025) and Bhalla et al. (2026) provided empirical and theoretical support showing standard SAEs fragment subspaces. SASA directly addresses this by replacing vector-based dictionary learning with subspace-based learning, aligning the tool with the geometry identified by these prior works.

How does SASA handle dead (inactive) groups during training?

Dead groups are identified by a running frequency estimate πₖ; any group with πₖ ≤ 10⁻⁴ over a 1,000-token window is considered dead. For each input, residual pre-activations are computed for dead groups and the s_aux groups with the largest squared norm are assigned as auxiliary activations to revive them.

What is the superposition hypothesis and how does it motivate SASA?

The superposition hypothesis holds that many sparse features are packed into a limited activation space, producing polysemantic neurons where a single unit responds to multiple unrelated concepts. This motivates dictionary-learning methods like SAEs to recover a more feature-aligned basis, and SASA extends this by learning subspace-aligned bases that match the actual multi-dimensional geometry of LLM representations.

Who are the authors of the SASA paper, and where was it published?

The paper does not state the authors' names or the publication venue in the provided text. It is available at arxiv.org/abs/2606.06333.

Key terms

Sparse Autoencoder (SAE): A neural network trained to reconstruct hidden-layer activations using a sparse combination of learned dictionary vectors, used in mechanistic interpretability to disentangle features encoded by language models.
SASA (Subspace-Aware Sparse Autoencoder): The method introduced in this paper, which replaces single-vector SAE latents with learned low-dimensional subspace blocks and enforces group-level sparsity via a nuclear-norm penalty.
decoder subspace: A low-dimensional linear subspace spanned by a block of decoder vectors, used in SASA to represent an entire multi-dimensional semantic feature as a single unit.
nuclear norm: The sum of the singular values of a matrix, used in SASA as a group-level regularizer that penalizes the effective rank of each block's reconstruction map rather than individual scalar coefficients.
feature splitting: The phenomenon where a standard SAE represents a single multi-dimensional semantic feature by allocating many near-collinear dictionary vectors instead of one coherent unit.
monosemanticity: The property of a latent unit responding to exactly one semantic concept, as opposed to polysemanticity where a unit responds to multiple unrelated concepts.
intrinsic dimensionality: The number of independent directions that actually carry signal in an activation space, which is typically much smaller than the full ambient dimension of the model's hidden states.
ℓ₁ regularization: A penalty on the sum of absolute values of code coefficients, applied coordinate-wise in standard SAEs to encourage sparse activations.
group-level sparsity: A sparsity constraint applied to entire blocks of latents simultaneously, so that a whole subspace block is either active or inactive rather than individual scalar coordinates.
Top-1 gate: A selection mechanism in SASA that activates only the single block with the highest encoder response for a given input, enforcing block-level sparsity.
redundancy ratio: A metric comparing the number of decoder atoms allocated to a feature against the feature's intrinsic dimension; a ratio greater than 1 indicates wasted capacity from feature splitting.
superposition hypothesis: The conjecture that neural networks pack many more sparse features into their activations than there are neurons, causing individual neurons to respond to multiple unrelated concepts.
mechanistic interpretability: A research program that seeks concrete, causal explanations of neural network behavior by identifying the specific circuits and features responsible for model outputs.
residual stream: The sequence of hidden-state vectors in a transformer that accumulates information across layers via residual connections, used here as the activation source for training SAEs.
sample complexity: The number of training examples required for a learning algorithm to achieve a given level of accuracy, which SASA reduces from exponential to polynomial in the feature dimension.
covering bound: A geometric lower bound on the number of dictionary atoms needed to approximate every point in a manifold within a given error tolerance, used to quantify the cost of feature splitting.
matrix Bernstein bound: A probabilistic inequality bounding the spectral norm of a sum of independent random matrices, used in the paper to derive sample complexity guarantees for covariance estimation.
dead groups: Blocks in SASA that are never selected by the Top-1 gate during training, identified by a running activation frequency falling below a threshold, and revived via auxiliary activations.

Read the original paper

Open the simplified reader on Paperglide