KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli

KVarN mitigates KV-cache quantization error accumulation in long-context reasoning by combining Hadamard rotation with dual-scaling variance normalization.

How can we prevent KV-cache quantization from causing catastrophic error accumulation in long-context reasoning tasks?

Large language models rely on KV-caches for efficient decoding, but quantizing these caches to low bit-widths causes errors to accumulate across timesteps, leading to degraded reasoning performance in long-horizon tasks. KVarN addresses this by applying a Hadamard rotation to redistribute channel-wise outliers, followed by an iterative dual-scaling variance normalization that fixes per-token magnitude errors. This approach establishes a new state-of-the-art for 2-bit KV-cache quantization, achieving near-lossless performance on complex reasoning benchmarks like MATH500 and AIME24 with negligible runtime overhead.

Paper Primer

Standard KV-cache quantization methods often treat the cache as a static block, failing to account for the compounding nature of errors during autoregressive decoding. KVarN identifies that these errors are not merely random noise but are driven by incorrect token magnitudes, which disproportionately degrade attention logits as the sequence length grows.

KVarN is a calibration-free quantizer: it rotates the K and V matrices in the channel dimension using a Hadamard transform to smooth outliers, then applies iterative variance normalization across both token and channel axes. This dual-scaling acts like a precision-preserving filter: it forces the variance of the token and channel dimensions to be uniform before quantization, effectively anchoring the magnitude of each token.

KVarN achieves state-of-the-art reasoning accuracy at 2-bit precision.

Evaluations on MATH500, AIME24, and HumanEval show KVarN consistently outperforms existing methods like KIVI and QuaRot while maintaining a 2.3-bit average per element. KVarN achieves 53.3% accuracy on AIME24/MATH500 compared to 55.5% for FP16, significantly outperforming KIVI (55.5% but with much lower token counts/stability) and other 2-bit baselines.

Why does KVarN use dual-scaling instead of just standard per-token scaling?

Standard scaling only addresses one dimension, which often increases kurtosis in the other. Dual-scaling iteratively normalizes both token and channel dimensions, which fixes tail-errors caused by incorrect token magnitudes without requiring calibration data.

Is the online normalization step computationally expensive?

No. The normalization adds only 1.9 ms of latency per 128 tokens, representing a 0.18% overhead compared to standard FP16 generation, making it highly efficient for real-time inference.

For researchers building long-context LLM systems, KVarN demonstrates that managing token-magnitude stability is more critical for reasoning performance than minimizing raw mean-squared error during quantization.

The KV-Cache Quantization Bottleneck

Test‑time scaling stresses KV‑cache memory, and quantization errors grow from mis‑scaled tokens.

Test‑time scaling lets large language models reason over ever‑longer generated sequences, but the KV‑cache grows linearly with length, quickly exhausting GPU memory.

Compressing the key and value matrices to a few bits saves memory, but the compression must preserve each token’s contribution to attention.

During autoregressive generation, each newly produced token is quantized again, so any scale error compounds step by step.

**Figure 1.** (a) What fraction of the top k% largest errors is due to magnitude rather than direction (decomposed as in Eq. 3). Large quantization errors in K are mostly due to incorrect token scaling. To fix outlier errors, the token magnitude needs to be better preserved. (b) Difference of per-token magnitude of quantized K to full-precision K matrix on Qwen3-4B under different 2-bit quantization methods (KIVI [19], HK: Hadamard rotated K, VarN(K): Variance-Normalized K, and KVarN: our proposed method). Variance normalization prevents the rounding process from scaling the norm of worst-case tokens; the effect synergizes with scale-error reduction by Hadamard rotation. KVarN effectively suppresses magnitude errors, which leads to better end-to-end test-time scaling.

Quantization errors in KV‑cache are not uniform; they are driven by scaling outliers.

Modeling Error Accumulation

Defines the KIVI baseline and the pseudo‑decode evaluation setting.

This section introduces the KIVI baseline quantization scheme and the pseudo‑decode evaluation regime that the paper uses to expose long‑horizon error accumulation.

KIVI quantizes the value matrix $V$ per token and the key matrix $K$ per channel, storing a low‑precision matrix plus a per‑channel offset and scale vector.

How does KIVI differ from a naïve per‑tensor quantization of the KV‑cache?

Naïve per‑tensor quantization applies a single scale and zero‑point to the entire $K$ matrix, which cannot correct channel‑wise magnitude drift. KIVI introduces a distinct offset and scale for each channel, allowing the quantizer to match the variance of each channel individually and thus avoid large scaling errors that would otherwise accumulate.

The evaluation simulates decoding by processing the sequence in fixed‑size blocks, quantizing the KV‑cache after each block, and feeding the quantized cache back for the next block.

Why not evaluate quantization error on a single full‑length generation instead of the block‑wise pseudo‑decode?

In a single full‑length run the KV‑cache is written once and read many times, but the quantization error is applied only at the start, masking the cumulative effect of repeated quantization. The block‑wise pseudo‑decode forces a fresh quantization after every block, exposing how errors grow over time—a phenomenon the paper’s central claim addresses.

The key insight is that the largest per‑token magnitude errors dominate end‑to‑end degradation; KVarN combats these errors by combining incoherence processing with dual‑scaling, thereby limiting error accumulation across blocks.

**Figure 4.** Prior KV-Cache quantization papers address the parallel prefill scenario (red dashed arrows). We propose a ‘pseudo-decode’ setting (green solid arrows) to better model decoding errors. We split the sequence into blocks of size b. After every block, the freshly produced K, V are quantized before being written back to the KV-cache. Subsequent blocks access a quantized cache, so quantization error accumulates over time. LLMs operate this way during decoding of long sequences (e.g., in test-time scaling, such as reasoning tasks). KVarN is designed to operate in this regime.

The KVarN Mechanism

Stabilizing KV‑cache scales with rotation and variance‑normalization eliminates error buildup.

Quantizing the KV‑cache introduces token‑scale errors that quickly compound across layers, causing the attention logits to drift as generation proceeds.

KVarN first scrambles each token’s channel vector with a Hadamard transform, then rescales every row and column to unit variance before rounding to the nearest integer.

Apply a $4\times4$ Hadamard matrix $H$ to each token: $HX$ yields a scrambled matrix with entries of similar magnitude.

Compute column means and standard deviations, divide each column by its std → column‑wise unit variance.

Compute row means and standard deviations on the column‑normalized matrix, divide each row by its std → row‑wise unit variance.

Round every entry to the nearest integer (RTN) and store the two scale vectors (row‑scale $s_r$, column‑scale $s_c$).

During dequantization, multiply the integer matrix by $s_r$ and $s_c$ to recover a near‑original cache.

The two‑stage normalization removes both token‑wise and channel‑wise magnitude drift, which would otherwise amplify across layers.

How does KVarN differ from the KIVI baseline that also scales the cache?

KIVI applies a single per‑token scale, leaving channel‑wise variance untouched; KVarN adds an orthogonal rotation and a second per‑channel scale, equalizing variance in both dimensions and thus suppressing the magnitude‑driven error that KIVI cannot fix.

The Hadamard transform is an orthogonal “shuffle” that mixes each token’s channel values without changing their overall energy.

Why not rotate the token dimension as well?

Rotating tokens would need to be undone for each new token position, adding $O(bc\log b)$ operations per step; the channel‑only rotation costs only $O(c\log c)$ once per block, a negligible overhead.

Variance‑normalization rescales each row and each column so that their statistical spread (standard deviation) becomes one, eliminating systematic magnitude differences.

Why normalize both rows and columns instead of just one?

Normalizing only rows would leave channel‑wise scale disparities, and normalizing only columns would leave token‑wise disparities; both axes contribute to the magnitude error that drives accumulation, so we must equalize both.

Collect the $b\times c$ cache matrix $X$ for the current block.

Apply the Hadamard matrix $H$ to each token: $X\leftarrow HX$.

Iteratively compute column‑wise $\operatorname{VarN}$, updating $X$.

Iteratively compute row‑wise $\operatorname{VarN}$ on the result.

Quantize the normalized matrix with round‑to‑nearest (RTN), storing the two scale vectors.

During decoding, dequantize by multiplying the integer matrix with the stored scales.

**Figure 2.** Schematic layout of KVarN. Every token is Hadamard-rotated in the channel dimension. After one block of tokens (e.g., 128) of generation, each block is variance-normalized in token and channel dimension, denoted by VarN(·). Finally, it is quantized with round-to-nearest (RTN). We store the standard scale and zero-point of the RTN, plus a second scale. At 2.3 average bits per element even with the second scale, KVarN outperforms or matches prior methods, see e.g. Tab. 3.

To verify that KVarN indeed curtails error buildup, we run the pseudo‑decode protocol: after every $b$ tokens we quantize the cache and continue decoding, measuring attention‑output reconstruction error over long horizons.

Empirical Performance and Analysis

KVarN delivers top reasoning accuracy while cutting KV‑cache bits and incurring negligible overhead.

The core premise is that KV‑cache errors stem mainly from mis‑scaled tokens; KVarN stabilizes those scales via variance‑normalization, preventing error buildup in long‑horizon decoding.

KVarN adds only 0.18 % runtime overhead compared to standard quantization.

Variance‑normalization costs 1.9 ms while full 128‑token generation costs 1050 ms on a 500 TFLOP fp16 GPU.

**Figure 3.** Replacing the 5% worst outlier errors with high precision values improves end-to-end KL-divergence more than fixing the other 95%, even though more MSE lies there (see Fig. 9).

**Figure 6.** Speed measurement on GPU in the fast vLLM framework. The variance-normalization causes a very minor overhead.

**Figure 8.** Joint distribution of $K$ magnitude and quantized $K$ magnitude using different quantization methods. KVarN tightly controls token scales, while the baseline KIVI and ablations of our method have substantial off-diagonals.

KVarN maintains reasoning accuracy on AIME24 and MATH500 while dramatically curbing outlier‑driven degradation.

Robustness and Ablations

Ablation studies isolate each design choice and quantify its impact.

We remove individual components of KVarN and measure the resulting degradation on the same benchmarks used in the main paper. Each ablation isolates a single design choice—Hadamard rotation, variance‑normalization, or dual‑scale dequantization—so we can attribute performance changes directly to that choice.

KVarN adds only a 0.18 % quantization‑latency overhead compared with the KIVI baseline.

Measured on Qwen3‑4B across context lengths from 4 k to 32 k tokens; wall‑clock timing of the dequantization kernel shows the gap never exceeds 0.18 %.

**Figure 7.** Arrangement of Hadamard transforms in an attention layer in our method.

**Figure 9.** Plot to complement Fig. 3. It shows much MSE remains, if we replace the largest top k% errors with the high precision value. Note that the top 5% of entries contribute less to MSE than the bottom 95%, but their end-to-end KL-divergence impact is greater than that of those bottom 95%.

**Figure 10.** Needle-in-a-Haystack on Qwen3-4B: KIVI vs. KVarN under Static and Accumulated prefill. Cells are colored and hatched by retrieval outcome (both keywords / one keyword / neither).

**Figure 11.** Median dequantization time per call for KIVI (single scale) vs. KVarN (dual scale, $s_2$ fused into the kernel) across context lengths. Lower is better.

Read the original paper

Open the simplified reader on Paperglide