DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence
DeepSeek-V4 introduces hybrid attention and manifold-constrained connections to enable efficient 1M-token context.
How does DeepSeek-V4 achieve million-token context efficiency through architectural compression and fine-grained communication-computation overlap?
Standard attention mechanisms scale quadratically with sequence length, creating a prohibitive computational bottleneck for ultra-long contexts and complex reasoning tasks. The authors introduce a hybrid attention architecture that interleaves Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to drastically reduce KV cache size and FLOPs, alongside Manifold-Constrained Hyper-Connections (mHC) to stabilize deep signal propagation. In a 1-million-token context, DeepSeek-V4-Pro achieves 27% of the single-token inference FLOPs and 10% of the KV cache size compared to its predecessor, DeepSeek-V3.2.
Paper Primer
The core architectural move is the hybrid attention mechanism: it compresses the Key-Value (KV) cache by consolidating blocks of tokens into single entries, then uses a lightning indexer to perform sparse attention on these compressed representations. This is analogous to a library index: instead of reading every page of every book, the model consults a summarized catalog to identify only the relevant sections before performing detailed attention.
To maintain training stability at scale, the authors replace standard residual connections with Manifold-Constrained Hyper-Connections (mHC). This constrains the residual mapping to the manifold of doubly stochastic matrices, ensuring the transformation is non-expansive and preventing signal explosion during deep stacking.
DeepSeek-V4-Pro significantly improves long-context efficiency over DeepSeek-V3.2.
In a 1M-token context, the model requires only 27% of the single-token inference FLOPs and 10% of the KV cache size. ~3.7x reduction in compute and 10x reduction in memory footprint.
The smaller DeepSeek-V4-Flash model achieves even higher relative efficiency.
At 1M-token context, it requires only 10% of the single-token FLOPs and 7% of the KV cache size compared to DeepSeek-V3.2. 10x reduction in compute and ~14x reduction in memory footprint.
Why is the hybrid attention approach (CSA + HCA) necessary instead of just using one compression method?
CSA provides a balance of sparse attention for core reasoning, while HCA applies more aggressive compression to handle the extreme scale of 1M tokens, allowing the model to maintain performance while drastically reducing the memory overhead of the KV cache.
What is the primary role of the Muon optimizer in this architecture?
Muon is used for the majority of modules to provide faster convergence and improved training stability compared to standard AdamW, particularly when handling the complex dynamics of the new mHC and hybrid attention layers.
DeepSeek-V4: Efficiency at Scale
We expose why vanilla attention blocks million-token contexts and outline DeepSeek‑V4’s motivation to overcome it.
Standard self‑attention scales quadratically with sequence length, making million‑token contexts prohibitively expensive. This computational bottleneck limits test‑time scaling and prevents practical use of long‑horizon tasks such as massive document analysis or multi‑step agentic workflows. DeepSeek‑V4 addresses the gap by replacing the vanilla attention stack with a hierarchical compression pipeline (Compressed Sparse Attention and Heavily Compressed Attention) and by overlapping expert‑parallel communication (Expert Parallelism) with computation, thereby breaking the scaling barrier.
The shift from standard Transformer scaling to compressed context intelligence unlocks efficient million‑token reasoning.
Architectural Innovations
DeepSeek‑V4 replaces quadratic attention with compressed, sparse mechanisms to enable million‑token contexts.
Long‑range language modeling is crippled by the quadratic memory and compute of vanilla self‑attention; the KV cache alone explodes when the context reaches millions of tokens.
Standard self‑attention builds a full $n\times n$ similarity matrix, so both compute and memory grow with the square of the sequence length.
mHC widens the residual stream like a highway, but forces the traffic‑flow matrix to be doubly stochastic, guaranteeing that no single lane can amplify signals beyond unity.
CSA first shrinks every $m$ KV tokens into a single compressed entry (like summarizing a paragraph), then lets each query attend only to the $k$ most promising summaries.
Project the 8 hidden vectors into $C^{a},C^{b}\in\mathbb{R}^{8\times 4}$ and compute weights $Z^{a},Z^{b}$.
Group tokens (1‑2), (3‑4), (5‑6), (7‑8); for each group apply the weighted‑sum‑plus‑bias formula (Eq 11) to obtain 4 compressed entries $C_{\text{Comp}}$.
Run the Lightning Indexer on each query token; suppose the top‑k selector keeps the two highest‑scoring compressed entries.
Concatenate those two compressed entries with a sliding‑window of the last $n_{\text{win}}=2$ original KV tokens.
Perform Multi‑Query Attention over the resulting set (4 vectors total) and project back to $d=4$.
Compressing $8\rightarrow4$ entries cuts the KV cache size by 50 % while the top‑k step ensures each query still sees only the most relevant context.
How does CSA differ from ordinary sparse attention that simply masks out distant tokens?
Ordinary sparse attention still stores every key/value vector and only reduces the number of dot‑product operations. CSA first collapses $m$ consecutive KV vectors into a single compressed representation, permanently shrinking the cache before any sparsity is applied. The subsequent sparse step operates on far fewer entries, yielding both memory and compute savings.
HCA pushes compression to the extreme: it merges $m'\!\gg\!m$ tokens into one entry (like turning an entire chapter into a headline) and then attends directly to those headlines without any additional sparsity.
Project the 12 hidden vectors to $C\in\mathbb{R}^{12\times4}$ and compute weights $Z$.
For group 1 (tokens 1‑4) compute $C_{\text{Comp},0}$ by softmax‑weighting the four vectors and adding bias $B$; repeat for groups 2 and 3.
Obtain three heavily compressed entries; concatenate with a sliding‑window of the last $n_{\text{win}}=2$ original KV tokens.
Run Shared Key‑Value MQA over the five vectors and project back to $d=4$.
Even with $m'=4$, the KV cache shrinks from 12 entries to 3, a 75 % reduction, while the sliding window preserves the most recent local information.
Why not use CSA with a much larger $m$ instead of introducing HCA?
CSA relies on a top‑k selector that needs a reasonably sized pool of compressed entries to rank; if $m$ becomes too large the pool becomes too coarse, degrading the quality of the selected context. HCA removes the selector entirely, accepting a coarser representation but gaining a far greater memory reduction, which is essential for the extreme‑length regimes where the selector would otherwise be ineffective.
Compress the KV cache with factor $m$ (CSA) or $m'$ (HCA) depending on the block type.
For CSA blocks, run the Lightning Indexer to score compressed entries.
Apply the top‑k selector (CSA only) to keep the most relevant compressed entries.
Concatenate the selected compressed entries with a sliding‑window of the most recent uncompressed KV tokens.
Feed the combined set into Shared Key‑Value Multi‑Query Attention.
Project grouped head outputs back to the model dimension.
Optionally add an Attention Sink logit to each head to regulate total attention mass.
**Figure 2.** Overall architecture of DeepSeek-V4 series. We use hybrid CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) for attention layers, DeepSeekMoE for feed-forward layers, and strengthen conventional residual connections with mHC.
**Figure 3.** Core architectures of CSA. It compresses the number of KV entries to $\frac{1}{m}$ times, and then applies DeepSeek Sparse Attention for further acceleration. Additionally, a small set of sliding window KV entries is combined with the selected compressed KV entries to enhance local fine-grained dependencies.
**Figure 4.** Core architectures of HCA. It performs heavier compression, where the KV entries of $m' (\gg m)$ tokens will be consolidated into one. Also, we additionally introduce a small set of sliding window KV entries to enhance local fine-grained dependencies.
Infrastructure and Parallelism
Fine‑grained expert parallelism pipelines communication and computation to hide latency and boost throughput.
Expert Parallelism (EP) accelerates Mixture‑of‑Experts but traditionally forces a heavyweight all‑to‑all communication phase that stalls the pipeline.
Instead of waiting for every expert to finish its dispatch before any computation starts, we split experts into small waves and run dispatch, computation, and result aggregation concurrently.
Step 1: Wave 0 dispatches its inputs to remote GPUs (E₀, E₁). Communication finishes after 2 µs.
Step 2: While Wave 0 runs Linear‑1 and Linear‑2 (≈8 µs), Wave 1 starts its dispatch (≈2 µs).
Step 3: As soon as Wave 0 finishes Linear‑2, it begins Combine; simultaneously Wave 1 begins its computation.
Step 4: After Wave 1 finishes Compute, its Combine overlaps with Wave 0’s next‑layer dispatch, keeping the pipeline full.
The overlap eliminates idle periods; the total wall‑time equals the compute time of a single wave plus one dispatch latency, not the sum of all four stages.
How does this differ from the naïve EP approach that batches all experts before any computation?
Naïve EP performs a single all‑to‑all dispatch for every expert, then blocks until every expert finishes before any Linear‑1 runs. Fine‑grained EP interleaves dispatch and compute per wave, so the network never becomes the sole critical path.
Two practical observations emerged from the kernel development.
Power headroom becomes a limiting factor once the pipeline fully overlaps; future accelerators should provision extra power budget for such fused kernels.
We adopt a pull‑based dispatch where each GPU actively reads remote activations, avoiding the high latency of push‑based notifications.
Replacing SwiGLU with a cheap element‑wise activation removes exponential/division costs and prevents the GEMM pipeline from stalling.
**Require:** Learning rate $\eta$, momentum $\mu$, weight decay $\lambda$, update rescaling factor $\gamma$ 1: **for** each training step $t$ **do** 2: **for** each logically independent weight $W \in \mathbb{R}^{n \times m}$ **do** 3: $G_t = \nabla_W \mathcal{L}_t(W_{t-1})$ $\quad \triangleright$ Compute gradients 4: $M_t = \mu M_{t-1} + G_t$ $\quad \triangleright$ Accumulate momentum buffer 5: $O'_t = \text{HybridNewtonSchulz}(\mu M_t + G_t)$ $\quad \triangleright$ Nesterov trick and hybrid Newton-Schulz 6: $O_t = O'_t \cdot \sqrt{\max(n, m)} \cdot \gamma$ $\quad \triangleright$ Rescale the update RMS 7: $W_t = W_{t-1} \cdot (1 - \eta\lambda) - \eta O_t$ $\quad \triangleright$ Perform weight decay and update 8: **end for** 9: **end for**
**Figure 5.** Illustration of our EP scheme with related works. Comet (Zhang et al., 2025b) overlaps Dispatch with Linear-1, and Linear-2 with Combine, separately. Our EP scheme achieves a finer-grained overlapping by splitting and scheduling experts into waves. The theoretical speedup is evaluated in the configuration of the DeepSeek-V4-Flash architecture.
**Figure 6.** Illustration of the KV cache Layout for DeepSeek-V4. The KV cache is organized into two primary components: a classical KV cache for CSA/HCA, and a state cache for SWA and unready-for-compression tokens in CSA/HCA. In the state cache, each request is assigned a fixed-size cache block. Within this block, the SWA segment stores the KV entries corresponding to the most recent $n_{win}$ tokens, while the CSA/HCA segment stores uncompressed tail states that are not yet ready for compression. In the classical KV cache, we allocate multiple blocks per request. Each cache block covers $lcm(m, m')$ original tokens, producing $k_1 = \frac{lcm(m,m')}{m}$ CSA compressed tokens and $k_2 = \frac{lcm(m,m')}{m'}$ HCA compressed tokens.
TileLang is a domain‑specific language that lets developers write a single high‑level operator, which the compiler expands into a tightly fused GPU kernel.
Batch‑invariant kernels guarantee that a token’s output is bitwise identical regardless of its position in the batch, achieved by a dual‑kernel strategy that preserves accumulation order.
Deterministic MoE backward replaces atomicAdd with per‑SM accumulation buffers followed by a deterministic global reduction, eliminating non‑associative floating‑point nondeterminism.
Data and Pre-Training
Pre‑training scales to 32 T tokens while keeping per‑token compute modest.
To push context length toward the million‑token regime, the authors expand the pre‑training corpus and introduce two stability‑focused tricks.
The pre‑training corpus now exceeds 32 T tokens, enabling reliable learning on sequences up to one million tokens.
Data Construction aggregates mathematical, code, web, and long‑document sources, deliberately filtering auto‑generated content.
Anticipatory Routing pre‑computes routing indices a step ahead, then reuses them when the main network processes the current batch—like reserving a train seat before the train departs so passengers board without delay.
How does Anticipatory Routing differ from simply caching routing decisions from the previous batch?
Caching the *previous* batch’s indices would still use stale network parameters for the current forward pass, but Anticipatory Routing deliberately computes indices with a *fixed* lag $\Delta t$ and aligns them with the exact data that will be processed later, ensuring the cached indices are valid for the upcoming step while the main network already sees the freshest weights.
SwiGLU Clamping caps the linear component of the SwiGLU activation to a safe interval, much like a thermostat limits temperature to avoid overheating.
Why not simply increase the gate’s upper bound instead of clamping the linear term?
Raising the gate limit would allow the gating signal to amplify already large linear outputs, worsening outliers. Clamping the linear term directly controls the magnitude that the gate can modulate, guaranteeing that the combined activation stays within a numerically stable range.
Scaling data diversity and extending context length are the primary levers that let DeepSeek‑V4 reach million‑token intelligence.
Evaluation Benchmarks
Post‑training swaps RL for OPD and yields a clear accuracy jump.
DeepSeek‑V4‑Flash‑Base beats DeepSeek‑V3.2‑Base on the average benchmark suite despite a 30 % smaller parameter budget.
Table 1 shows higher scores on world‑knowledge and long‑context tasks while the activated‑parameter count drops from 671 B to 284 B.
**Table 1.** Comparison among DeepSeek-V3.2-Base, DeepSeek-V4-Flash-Base, and DeepSeek-V4-Pro-Base. All models are evaluated in our internal framework and share the same evaluation setting. Scores with a gap not exceeding 0.3 are considered to be at the same level. The highest score in each row is in **bold font**, and the second is <u>underlined</u>.
**Table 2.** Comparison of three reasoning modes
Standard Benchmark Evaluation
Post‑training evaluation shows parity with top models and clear gains on reasoning tasks.
We evaluate the post‑training DeepSeek‑V4 series across a battery of standard benchmarks, long‑context tasks, and real‑world agentic scenarios.
DeepSeek‑V4‑Pro‑Max achieves performance parity with leading closed models while delivering consistent gains on reasoning benchmarks.
Table 6 shows state‑of‑the‑art results on knowledge, reasoning, and agentic tasks; Table 7 confirms that the Max reasoning mode improves the hardest benchmarks. +20 pp over open‑source baselines on SimpleQA‑Verified
**Table 10.** Cost Comparison: Agentic Search vs. Retrieval Augmented Search (Mean) for DeepSeek-V4-Pro. Most of the tool calls are parallel for Agentic Search.
**Figure 10.** HLE and Terminal Bench 2.0 performance by reasoning effort. “None” indicates Non-think mode, and “Speciale” indicates DeepSeek-V3.2-Speciale model.
**Figure 11.** Win-rate comparison across analysis, generation, editing tasks, and the overall performance.
**Figure 12.** Detailed dimension scores including Task Completion, Content Quality, Formatting Aesthetics, and Instruction Following.
**Figure 8.** Formal reasoning under practical and frontier regimes. Left: Putnam-200 Pass@8 evaluates a fixed random subset of PutnamBench (Tsoukalas et al., 2024) following the setup introduced by Seed-Prover; all models are tested on the same problem set. We follow the Seed-Prover protocol but replace proprietary search tools with the open-source LeanExplore (Asher, 2025), yielding a lightweight setting with minimal agent tools and bounded sampling. Right: Putnam-2025 probes the frontier of mathematical reasoning in a scaled hybrid formal-informal regime, where informal reasoning is combined with formal verification to expose gaps and improve rigor; DeepSeek-V4 reaches a proof-perfect 120/120.
**Figure 9.** DeepSeek-V4 series performance on the MRCR task.
**Table 12.** Comparative Analysis of DeepSeek-V4-Pro and Gemini-3.1-Pro in Chinese Functional Writing.
**Figure 13.** Example output of a task which requires drafting a joint marketing proposal for a popular bubble tea brand and the Beijing Subway.
**Table 14.** DeepSeek-V4-Pro vs. Claude-Opus-4.5 on Complex Instruction Following and Multi-Turn Writing.
**Table 11.** Comparative Evaluation of DeepSeek-V4-Pro and DeepSeek-V3.2 on Search Q&A Tasks.
**Table 8.** Comparison on R&D Coding Benchmark (external models included strictly for evaluation purposes).
Across reasoning, knowledge, and agentic tasks, DeepSeek‑V4‑Pro‑Max matches or exceeds the best open models while approaching closed‑source performance.
Appendix and Author List
Appendix provides author list, acknowledgments, and detailed evaluation tables for DeepSeek‑V4‑Pro.
The author list enumerates every contributor, ordered alphabetically by first name; an asterisk marks individuals who have left the team.
The authors thank Dolly Deng and other testers for their valuable suggestions and feedback on the DeepSeek‑V4 series.
This appendix supplies exhaustive quantitative comparisons across multiple tasks, reporting win/tie counts, tool‑call costs, and head‑to‑head performance against prior models and external competitors.
Table 9 shows that Retrieval‑Augmented Search wins a larger share on easy queries, while Agentic Search leads on hard queries (≈68 % vs ≈62 % win rate respectively).
Table 10 reports that Agentic Search issues far more tool calls (13 649) and consumes many more prefill tokens (10 453) than Retrieval‑Augmented Search, which makes only 16.2 calls and processes roughly 1 500 prefill tokens.
Table 11 compares DeepSeek‑V4‑Pro with its predecessor V3.2 across twelve search‑Q&A sub‑categories; V4‑Pro achieves higher win percentages in the majority of tasks.
Table 12 evaluates functional Chinese writing, where DeepSeek‑V4‑Pro wins 162 categories against Gemini‑3.1‑Pro’s 103 wins, with 56 ties.
Table 13 presents creative Chinese writing results; DeepSeek‑V4‑Pro records 836 wins versus Gemini‑3.1‑Pro’s 662, and 410 ties.
Table 14 contrasts DeepSeek‑V4‑Pro with Claude‑Opus‑4.5 on complex instruction following and multi‑turn writing, where Claude‑Opus‑4.5 leads (147 wins vs 49 for DeepSeek‑V4‑Pro).
Questions & answers
What is the main contribution of DeepSeek-V4?
DeepSeek-V4 introduces a hybrid attention architecture that interleaves Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to drastically reduce KV cache size and FLOPs, enabling practical million-token context inference. It also introduces Manifold-Constrained Hyper-Connections (mHC) to stabilize deep signal propagation during training.
What problem does DeepSeek-V4 address?
DeepSeek-V4 addresses the quadratic scaling of standard self-attention with sequence length, which makes million-token contexts computationally prohibitive. This bottleneck limits test-time scaling and prevents practical use of long-horizon tasks such as massive document analysis or multi-step agentic workflows.
How does Compressed Sparse Attention (CSA) work?
CSA first collapses m consecutive KV vectors into a single compressed representation, permanently shrinking the KV cache before any sparsity is applied. A subsequent top-k sparse selection step then operates on the smaller pool of compressed entries, yielding both memory and compute savings compared to ordinary sparse attention.
How does CSA differ from ordinary sparse attention?
Ordinary sparse attention still stores every key/value vector and only reduces the number of dot-product operations, whereas CSA first compresses blocks of tokens into single entries, permanently reducing the cache size before sparsity is applied. This means CSA achieves both memory and compute reductions, not just compute reductions.
What is Heavily Compressed Attention (HCA) and why is it needed alongside CSA?
HCA applies more aggressive compression than CSA by removing the top-k selector entirely, accepting a coarser representation in exchange for far greater memory reduction. It is necessary for extreme-length regimes (up to 1 million tokens) where CSA's selector would become ineffective if the compression factor m were increased enough to handle the scale.
What are Manifold-Constrained Hyper-Connections (mHC) and what problem do they solve?
mHC replaces standard residual connections by constraining the residual mapping to the manifold of doubly stochastic matrices, ensuring the transformation is non-expansive. This prevents signal explosion during deep stacking and stabilizes training at scale, particularly given the complex dynamics introduced by the hybrid attention layers.
How efficient is DeepSeek-V4-Pro compared to DeepSeek-V3.2?
In a 1-million-token context, DeepSeek-V4-Pro achieves 27% of the single-token inference FLOPs and 10% of the KV cache size of its predecessor DeepSeek-V3.2. This represents a dramatic reduction in both compute and memory requirements for ultra-long contexts.
What optimizer does DeepSeek-V4 use and why?
DeepSeek-V4 uses the Muon optimizer for the majority of modules, chosen because it provides faster convergence and improved training stability compared to standard AdamW. This is particularly important for handling the complex dynamics introduced by the new mHC and hybrid attention layers.
What is Anticipatory Routing and how does it differ from caching previous routing decisions?
Anticipatory Routing computes routing indices with a fixed lag Δt and aligns them with the exact data that will be processed in a future step, ensuring the cached indices are valid for the upcoming step while the main network already uses the freshest weights. Simply caching the previous batch's indices would use stale network parameters for the current forward pass, making them misaligned.
How does DeepSeek-V4 handle Expert Parallelism (EP) more efficiently?
DeepSeek-V4 uses fine-grained EP that interleaves dispatch and compute per wave, so the network communication never becomes the sole critical path. This contrasts with naïve EP, which performs a single all-to-all dispatch for every expert and blocks until every expert finishes before any computation begins.
What benchmarks and evaluation tasks were used to assess DeepSeek-V4?
The paper evaluates the DeepSeek-V4 series across standard benchmarks covering reasoning, knowledge, and agentic tasks, as well as long-context tasks and real-world agentic scenarios. Specific evaluations include search Q&A sub-categories, functional and creative Chinese writing, complex instruction following, and multi-turn writing, with comparisons against models including DeepSeek-V3.2, Gemini-3.1-Pro, and Claude-Opus-4.5.
What are the key quantitative results from DeepSeek-V4's evaluations?
DeepSeek-V4-Pro wins 162 functional Chinese writing categories versus Gemini-3.1-Pro's 103 wins, and 836 versus 662 in creative Chinese writing. However, Claude-Opus-4.5 leads on complex instruction following and multi-turn writing with 147 wins versus 49 for DeepSeek-V4-Pro. On search tasks, Agentic Search achieves approximately 68% win rate on hard queries versus approximately 62% for Retrieval-Augmented Search.
What are the limitations or open problems acknowledged by the paper?
The paper notes that power headroom becomes a limiting factor once the pipeline fully overlaps compute and communication, and suggests future accelerators should provision extra power budget for such fused kernels. The paper does not extensively enumerate other limitations or failure modes of the approach.
How does DeepSeek-V4 ensure numerical stability during training?
Two stability-focused techniques are used: Anticipatory Routing aligns cached routing indices with the correct future data to avoid stale routing, and the linear term in the activation function is clamped to prevent large outputs from being amplified by the gate, keeping activations within a numerically stable range. mHC also contributes to stability by constraining residual mappings to be non-expansive.
How does DeepSeek-V4 achieve deterministic training despite parallel computation?
Deterministic MoE backward replaces atomicAdd operations with per-SM accumulation buffers followed by a deterministic global reduction, eliminating non-associative floating-point nondeterminism. Additionally, batch-invariant kernels guarantee that a token's output is bitwise identical regardless of its position in the batch, achieved through a dual-kernel strategy that preserves accumulation order.
Who are the authors and what is the venue or date of DeepSeek-V4?
The paper states that the author list is ordered alphabetically by first name, with an asterisk marking individuals who have left the team, and acknowledges testers including Dolly Deng. The paper does not specify a publication venue or exact release date.
How does DeepSeek-V4 extend context length during pre-training?
The authors expand the pre-training corpus and introduce stability-focused techniques including Anticipatory Routing and activation clamping to support training at million-token context lengths. The paper states that scaling data diversity and extending context length are the primary levers for reaching million-token intelligence.
How does Agentic Search differ from Retrieval-Augmented Search in DeepSeek-V4's evaluation?
Agentic Search issues far more tool calls (13,649) and consumes many more prefill tokens (10,453 on average) compared to Retrieval-Augmented Search, which makes only 16.2 calls and processes roughly 1,500 prefill tokens. Agentic Search leads on hard queries with approximately 68% win rate, while Retrieval-Augmented Search wins a larger share on easy queries.
Key terms
- Compressed Sparse Attention (CSA)
- An attention mechanism that first compresses blocks of m consecutive key-value vectors into single entries to shrink the KV cache, then applies sparse top-k selection over the compressed entries to further reduce computation.
- Heavily Compressed Attention (HCA)
- A more aggressive attention compression method than CSA that removes the top-k selector entirely, accepting coarser representations in exchange for much greater memory reduction, suited for extreme-length contexts.
- Manifold-Constrained Hyper-Connections (mHC)
- A replacement for standard residual connections that constrains the residual mapping to the manifold of doubly stochastic matrices, ensuring the transformation is non-expansive and preventing signal explosion in deep networks.
- KV cache
- A memory buffer that stores the key and value tensors computed during attention for all tokens in the context, which grows linearly with sequence length and becomes a major memory bottleneck for long contexts.
- Mixture-of-Experts (MoE)
- A neural network architecture where different subsets of parameters (experts) are selectively activated for each input token, allowing a large total parameter count while keeping per-token compute manageable.
- Expert Parallelism (EP)
- A distributed training and inference strategy that places different MoE experts on different devices, requiring all-to-all communication to route tokens to the correct expert.
- Fine-grained Expert Parallelism
- An improved EP approach that interleaves dispatch communication and expert computation per wave, preventing network communication from becoming the sole bottleneck in the pipeline.
- Anticipatory Routing
- A routing strategy that pre-computes expert routing indices with a fixed lag Δt aligned to future data batches, so that cached routing decisions remain valid for the step they will be used in.
- Muon optimizer
- An optimization algorithm used in place of AdamW for most model modules, offering faster convergence and improved training stability according to the paper.
- Doubly stochastic matrix
- A square matrix of non-negative real numbers where every row and every column sums to one, used in mHC to constrain residual transformations to be non-expansive.
- SwiGLU
- A gated linear unit activation function combining a swish activation with a gating mechanism, which the paper replaces with a cheaper element-wise activation to avoid stalling the GEMM pipeline.
- Agentic Search
- A search strategy in which the model autonomously issues many tool calls and processes large amounts of context to answer queries, as opposed to simpler retrieval-augmented approaches.
- Retrieval-Augmented Search (RAS)
- A search strategy that retrieves a small number of relevant documents and prepends them to the model's context, using far fewer tool calls and prefill tokens than Agentic Search.
- FLOPs
- Floating-point operations, a measure of the computational work required to perform a forward pass or inference step in a neural network.
- Top-k selector (lightning indexer)
- A component in CSA that ranks compressed KV entries and selects the k most relevant ones for attention, analogous to consulting a summarized index rather than reading every entry.
- Deterministic MoE backward
- A training technique that replaces non-deterministic atomic addition operations with per-SM accumulation buffers and a deterministic global reduction, ensuring reproducible gradient computations regardless of parallelism.
- Batch-invariant kernel
- A GPU kernel implementation that produces bitwise-identical outputs for a token regardless of its position within a batch, achieved through a dual-kernel strategy that preserves floating-point accumulation order.