Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi

Tangram enables high-throughput multi-turn LLM serving by making non-uniform KV cache compression systemically practical.

How can we implement non-uniform KV cache compression in production LLM serving systems without incurring the massive memory fragmentation and load-balancing overheads that currently make it impractical?

Multi-turn LLM serving requires maintaining massive dialogue histories in the Key-Value (KV) cache, which grows linearly and quickly exhausts GPU memory. While non-uniform compression can reduce this footprint by pruning less important tokens, it creates heterogeneous cache sizes that break the rigid, uniform memory assumptions of current serving systems like vLLM. Tangram resolves this by profiling each attention head's retention needs offline and using these stable, model-intrinsic patterns to pre-allocate memory and pre-calculate GPU workloads. This eliminates the need for costly runtime page reclamation and dynamic load balancing. On long-context benchmarks, Tangram improves serving throughput by up to 2.6× compared to state-of-the-art baselines while maintaining model accuracy.

Paper Primer

Existing serving systems rely on PagedAttention, which assumes all attention heads have identical KV cache lengths. When non-uniform compression is applied, this assumption fails: memory becomes fragmented because the system cannot reclaim space from individual heads, and GPU Streaming Multiprocessors (SMs) become imbalanced because some heads are much "heavier" than others.

Tangram treats head-wise retention as a static, model-intrinsic property rather than a dynamic runtime variable. It uses three moves: Deterministic Budget Allocation replaces runtime compression with pre-calculated memory footprints; Head Group Page clusters heads with similar budgets into independent page tables to maximize reclamation; and Ahead-of-Time (AOT) Load Balancing pre-computes workload partitions to ensure uniform GPU utilization.

Tangram achieves significant throughput gains over state-of-the-art serving frameworks.

End-to-end throughput measured on A100 GPUs across multi-turn benchmarks (SCBench, LoCoMo, RealTalk, LongMemEval).

Tangram eliminates the runtime overhead of dynamic page reclamation.

By using static budget profiles, the system avoids the "compress-and-reclaim" process that consumes up to 25% of prefill execution time in dynamic systems.

Why does non-uniform KV compression cause "straggler" effects on GPUs?

Because attention kernels parallelize work across heads, heads with larger retained KV caches take longer to process than those with smaller caches. If the workload is partitioned assuming uniform lengths, the GPU SMs assigned to the "heavy" heads become bottlenecks, forcing the entire system to wait for them to finish.

Does the offline profiling approach generalize to different user inputs?

Yes. The authors observe that while absolute retention values may shift slightly across domains, the relative ranking of head importance is stable and model-intrinsic. A small pilot set of samples is sufficient to calibrate budgets that remain robust across diverse multi-turn tasks.

Tangram’s core insight is that the "unpredictability" of non-uniform KV cache is an artifact of trying to manage it at runtime; by shifting these decisions to an offline profiling stage, the system can treat heterogeneous memory as a static, manageable blueprint.

Tangram demonstrates that non-uniform KV compression is not inherently incompatible with high-performance serving; it simply requires moving from dynamic, reactive scheduling to static, profile-aware resource management.

The KV Cache Bottleneck

We expose the KV‑cache bottleneck that limits multi‑turn LLM serving and motivate non‑uniform compression.

Serving multi‑turn LLMs requires persisting the attention states in a KV cache. As each turn adds tokens, the cache grows linearly and soon exceeds the model’s weight size, saturating GPU memory and bandwidth.

The KV cache expands proportionally with dialogue length, quickly dwarfing the model parameters and becoming the dominant memory consumer.

Non‑uniform KV compression keeps the most important entries per head, preserving accuracy while cutting size. However, the resulting heterogeneity fragments memory, complicates scheduling, and harms kernel efficiency.

Instead of trimming every head equally, the system retains a variable number of tokens per head based on importance scores, so critical heads keep more context.

These heterogeneities manifest as three systemic limitations: (1) a monolithic page structure prevents independent reclamation of head‑wise memory, (2) on‑the‑fly reclamation of scattered pages creates prohibitive control‑plane overhead, and (3) uneven KV lengths produce straggler effects that under‑utilize GPU SMs.

**Figure 1.** (a) KV cache size growth for Qwen2.5-32B with the number of conversation sessions (top, # requests = 16) or with the # of requests (bottom, session number = 10). The dashed line indicates the model weight size. (b) Comparison of uniform and non-uniform KV compression strategies at a 50% compression rate, where the numbers in each box denote the importance score of each KV entry.

**Figure 2.** (a) Distribution of KV cache entries capturing the top 50% attention scores on Qwen3-4B, averaged over 50 samples from the SCbench [23]. (b) Comparative accuracy on long-term conversation QA benchmarks [20] using KVzip [16] with Uniform and non-uniform KV compression.

KV cache growth is the primary constraint for multi‑turn LLM serving.

Serving Architecture and Baselines

Background outlines KV cache challenges and the vLLM serving architecture.

The KV cache stores attention states for every token, and its size grows with layer count, head count, and dialogue length. In multi‑turn serving this memory pressure quickly eclipses model size, making cache capacity the dominant bottleneck.

vLLM orchestrates request scheduling, block allocation, and attention computation so that many user sessions share a single GPU without explicit per‑request memory duplication.

The serving pipeline consists of five tightly coupled components: a scheduler, a block table, KV cache generation, KV cache blocks, and the attention kernel. All of them implicitly assume that each head stores the same number of KV entries.

Continuous Batching drives request admission by inspecting the block pool and allocating pages for each runnable request. The block table records the total pages needed, treating each token as a fixed‑cost memory unit.

In the Generate KV Cache stage, the GPU worker writes KV entries into the pre‑allocated blocks (PagedAttention). Entries are stored non‑contiguously to avoid fragmentation, but the page size remains the smallest allocatable unit.

Attention kernel optimizations such as FlashAttention‑2 fuse query‑key multiplication with value aggregation, while FlashDecoding and FlashInfer add KV‑dimension parallelism. FlashDecoding uses static heuristics for partitioning; FlashInfer performs a runtime planning phase to choose optimal splits.

Non‑uniform KV compression breaks the uniform‑length assumption, leading to three concrete limitations. First, the monolithic page structure cannot reclaim memory from heads that retain fewer tokens, causing page fragmentation.

Second, the scheduler’s static page‑usage estimate becomes stale because compression decisions are only known after the forward pass, forcing a costly compress‑and‑reclaim step that can consume up to 25 % of prefill time.

Third, GPU kernels suffer workload imbalance: static KV chunk partitioning (as in FlashDecoding) leaves some SMs idle while others process larger chunks, producing a straggler effect that degrades throughput.

System-Level Inefficiencies

Non‑uniform KV compression creates runtime bottlenecks that motivate deterministic budgeting.

While non‑uniform KV compression keeps multi‑turn accuracy, deploying it on the vLLM serving stack exposes severe runtime inefficiencies. The heuristic that the `num_splits` parameter is uniform across heads only holds when every head shares the same context length. When KV lengths diverge, workloads become highly skewed, and the decoding step stalls on the heaviest thread blocks.

Our analysis shows that per‑head retention profiles are largely input‑independent: each head’s KV budget is stable across queries, and the relative ranking of heads is preserved across model families. This static property overturns the assumption of runtime unpredictability and enables deterministic, ahead‑of‑time budgeting.

**Figure 4.** Challenges posed by non-uniform KV compression. (a) *Monolithic Page Structure*: unified pages span all heads, causing page fragmentation (red dashed line: pages allocated per request). (b) *Dynamic Page Reclamation*: reclaiming scattered pages at runtime incurs severe control-plane overhead. (c) *Workload Imbalance*: uniform KV splits across thread blocks cause stragglers under different per-head KV lengths.

**Figure 5.** Workload imbalance on decode attention. (a) Uniform KV compression, (b) Non-uniform KV compression, (c) attention latency across different number of requests configurations on Qwen3-4B. The dashed line indicates the maximum workload among all thread blocks.

Head Group Page Structure

Deterministic budgets replace runtime compression, making memory usage predictable.

Dynamic evict‑and‑reclaim during inference stalls the serving pipeline because the KV cache size fluctuates after each compression step. Tangram solves this by profiling head‑wise importance once offline and fixing a per‑head memory budget that is used at runtime.

Allocate a fixed number of KV slots to each attention head based on offline‑measured importance, so the runtime never needs to decide how much to keep.

Why not allocate budgets using only the mean $\mu_{\ell,h}$ without the safety margin?

Using only the mean would ignore the observed variance; on inputs where a critical head needs a few extra slots, the static budget could truncate essential context, causing a measurable drop in multi‑turn accuracy. The $\alpha\sigma_{\ell,h}$ term cushions those occasional spikes.

Compute budgets: $B_{1,1}= \min(8, 4+1.5\cdot1)=5$, $B_{1,2}= \min(8, 2+1.5\cdot0.5)=2.75\rightarrow2$, $B_{2,1}= \min(8, 3+1.5\cdot0.8)=4.2\rightarrow4$, $B_{2,2}= \min(8, 1+1.5\cdot0.2)=1.3\rightarrow1$.

Assume importance scores $s_{1,1}=[0.9,0.8,0.4,0.3,0.2,0.1,0.05,0.01]$; top‑5 indices are $\{0,1,2,3,4\}$, so $I_{1,1}=\{0,1,2,3,4\}$.

For head $(1,2)$, scores $s_{1,2}=[0.6,0.5,0.2,0.1,0.05,0.02,0.01,0.0]$; top‑2 indices are $\{0,1\}$, so $I_{1,2}=\{0,1\}$.

Similarly compute $I_{2,1}$ (top‑4) and $I_{2,2}$ (top‑1) from their respective score vectors.

After selection, each KV tensor is truncated to the sizes $B_{\ell,h}$, yielding a total memory of $5+2+4+1=12$ token slots across all heads.

Static budgets give a predictable total memory (12 slots here) while still allocating enough capacity to the most important heads, and the safety margin prevents occasional under‑allocation.

Cluster heads that require similar KV lengths into a group and give each group its own memory page, so pages can be reclaimed independently without fragmenting the whole cache.

With deterministic budgets and group‑wise pages, the scheduler can allocate exactly the required memory up front, eliminating any runtime page‑reclamation step and unlocking continuous batching for long‑context workloads.

AOT Workload Partitioning

We eliminate fragmented memory and runtime load‑balancing overhead by clustering heads and pre‑allocating work.

Non‑uniform KV cache compression leaves memory fragmented and forces the GPU scheduler to rebalance work at every decoding step. This per‑step planning dominates the critical path, especially when the number of layers $L$ is large.

Group attention heads whose static budgets are close so that a single page allocation can serve the whole group without over‑provisioning.

Sort budgets → $[1,2,5,5,6,7,9,10]$ (already ordered).

Form groups: $G_{\ell,1}=\{1,2\}$, $G_{\ell,2}=\{5,5\}$, $G_{\ell,3}=\{6,7\}$, $G_{\ell,4}=\{9,10\}$.

Group 1’s max budget = 2, Group 2’s max = 5, Group 3’s max = 7, Group 4’s max = 10.

Because each group’s allocation is driven by its own maximum, the total allocated memory is $2+5+7+10=24$, far less than the $1+2+5+5+6+7+9+10=45$ that a monolithic page would require.

After clustering, Tangram gives each head group its own page table, decoupling global memory management from the number of heads.

Maintaining a separate table per group would be $O(H)$ work, so we aggregate the tables into a vectorized block table and process them with OpenMP and AVX‑512 SIMD lanes.

Because each head’s KV budget is fixed offline, we can pre‑compute how many CTA blocks each head group should receive, removing the need for per‑step scheduling.

Algorithm 1 – AOT Workload Partitioning (simplified)

How does AOT Workload Partitioning differ from a dynamic load‑balancer that also spreads work across SMs?

Dynamic schemes recompute a partition at every decoding step, incurring CPU latency proportional to $L$ and the number of groups. AOT uses the static budgets $B$ to compute $S$ once offline, so the kernel can launch the exact number of CTA blocks without any per‑step planning.

Layer 0: $\Omega_0 = 3+1+2+4 = 10$, $\tau_0 = \lceil10/2\rceil = 5$.

Group 0 (heads 0‑1): $\Phi_{0,0}=3+1=4$, $S_{0,0}= \lceil4/5\rceil = 1$.

Group 1 (heads 2‑3): $\Phi_{0,1}=2+4=6$, $S_{0,1}= \lceil6/5\rceil = 2$.

Layer 1: $\Omega_1 = 5+2+1+3 = 11$, $\tau_1 = \lceil11/2\rceil = 6$.

Group 0: $\Phi_{1,0}=5+2=7$, $S_{1,0}= \lceil7/6\rceil = 2$.

Group 1: $\Phi_{1,1}=1+3=4$, $S_{1,1}= \lceil4/6\rceil = 1$.

The resulting split map $S=\begin{bmatrix}1&2\\2&1\end{bmatrix}$ balances the CTA load across groups while respecting each group’s memory budget.

**Algorithm 1** AOT Workload Partitioning with Head Group **Require:** Calibrated budget tensor $B \in \mathbb{N}^{L \times H}$ where $L$ is the number of layers and $H$ is the number of KV heads, available CTAs $N_{CTA}$, head group size $G$ **Ensure:** Static split map $S \in \mathbb{N}^{L \times (H/G)}$ 1: $S \leftarrow 1_{L \times (H/G)}$ $\triangleright$ initialize split factors per head group 2: **for** $\ell \leftarrow 1$ to $L$ **do** 3: $\quad \Omega_\ell \leftarrow \sum_{h=1}^H B_{\ell,h}$ $\triangleright$ total KV budget of layer $\ell$ 4: $\quad$ **if** $\Omega_\ell = 0$ **then** continue 5: $\quad$ **end if** 6: $\quad \tau_\ell \leftarrow \max(1, \lceil \Omega_\ell / N_{CTA} \rceil)$ $\triangleright$ target per-split budget 7: $\quad$ **for** $i \leftarrow 0$ to $H/G - 1$ **do** $\triangleright$ iterate over head groups 8: $\quad \quad G_i \leftarrow \{h \mid h \in [i \cdot G, (i + 1) \cdot G - 1]\}$ $\triangleright$ heads in group $i$ 9: $\quad \quad \Phi_{\ell,i} \leftarrow \sum_{h \in G_i} B_{\ell,h}$ $\triangleright$ aggregated group budget 10: $\quad \quad S_{\ell,i} \leftarrow \max(1, \lceil \Phi_{\ell,i} / \tau_\ell \rceil)$ $\triangleright$ split factor for group $i$ 11: $\quad$ **end for** 12: **end for** 13: **return** $S$

**Figure 13.** Attention latency evaluated under the impact of AOT (Ahead-Of-Time) load balancing (fixed batch size of 4).

By clustering heads, giving each cluster its own page table, vectorizing block updates, and pre‑computing CTA splits, Tangram removes both memory fragmentation and runtime scheduling overhead, delivering a uniformly balanced decode pipeline.

Performance Evaluation

We benchmark Tangram’s accuracy and system efficiency on long‑context LLM serving.

We evaluate Tangram on three large language models across four long‑context benchmarks, measuring both accuracy and system efficiency. The evaluation isolates the impact of deterministic budget allocation, head‑group paging, and AOT load balancing.

Tangram delivers up to 2.5× higher throughput than the vLLM baseline across all models and compression rates.

Figure 10 shows the maximum speedup of 2.5× on Qwen3‑4B at 75 % KV eviction, with consistent gains on the other models.

**Figure 9.** Accuracy performance across Short, Mid, and Long context scales under varying KV cache retention rates. All results are based on non-uniform compression.

**Figure 10.** Throughput performance breakdown on the multi-turn [20, 23, 26, 34] benchmark across various compression rates. Results are measured with a head group size of $G = 4$, where the compression rate denotes the percentage of the KV cache removed.

**Figure 11.** Latency breakdown across various compression rates. While dynamic allocation incurs significant page reclamation overhead that scales with the eviction rate, TANGRAM's deterministic budget allocation completely eliminates this extra cost, operating with zero page reclaim overhead.

**Figure 12.** Throughput across various compression rates by head group size (G). Small G minimizes fragmentation but incurs high management overhead, while large G fails to effectively reclaim memory.

**Figure 14.** TTFT (Time-To-First-Token) under increasing throughput pressure with 30K average request lengths is maintained through deterministic budget allocation and Head Group Page with a 75% compression rate.

Overall, Tangram matches or exceeds the accuracy of prior non‑uniform KV compression methods while delivering substantial system‑level speedups and eliminating load‑balancing overhead.

Related Work and Conclusion

We situate Tangram among recent serving research and recap its three system innovations.

Multi‑turn LLM serving keeps a user’s conversation history alive across requests, enabling assistants to recall long‑horizon details. Benchmarks such as SCbench, RealTalk, and LoCoMo expose how quickly accuracy degrades when context length grows. Existing work focuses on retrieval augmentation but largely ignores the serving‑side cost of maintaining the ever‑growing KV cache.

KV cache compression reduces the memory footprint of stored attention states. Uniform compression forces every attention head to keep the same fraction of tokens, which is simple but often discards head‑specific context. Non‑uniform compression lets each head retain a tailored budget, preserving accuracy at the cost of system‑level complexity.

Heterogeneous memory management relaxes the assumption that all KV slots are allocated identically, allowing different layers or heads to receive distinct memory shares. Jenga demonstrates this idea for hybrid architectures, showing that per‑layer heterogeneity can boost accuracy while keeping memory use efficient.

Our end‑to‑end evaluation compares Tangram to the vLLM baseline and reports up to 2.6× higher throughput when the full suite of techniques is enabled. Incremental ablations reveal that Deterministic Budget Allocation, Head Group Page, and Ahead‑of‑Time Load Balancing each contribute additive gains, confirming that the system‑level optimizations translate the theoretical KV savings into real performance.

Dynamic compression normally incurs a heavy page‑reclamation cost, consuming up to 25 % of prefill time as scattered pages are tracked and freed. By fixing the exact memory footprint before execution, Deterministic Budget Allocation eliminates any need for reclamation, removing that overhead entirely.

In summary, Tangram makes non‑uniform KV compression practical by (1) converting dynamic budgets into a static allocation, (2) clustering heads with similar retention needs into independent Head Group Pages, and (3) pre‑computing balanced workload partitions Ahead‑of‑Time. The combined effect is a robust serving system that delivers up to 2.6× throughput with minimal accuracy loss.

Questions & answers

What is Tangram's main contribution?

Tangram introduces a serving system that enables efficient non-uniform KV cache compression for multi-turn LLM inference by using three coordinated techniques: Deterministic Budget Allocation, Head Group Page structure, and Ahead-of-Time (AOT) Load Balancing, together achieving up to 2.6× throughput improvement over the vLLM baseline.

What problem does Tangram address?

Tangram addresses the GPU memory bottleneck in multi-turn LLM serving, where KV caches grow linearly with dialogue length and quickly exhaust GPU memory. Specifically, it resolves the system-level inefficiencies—memory fragmentation, costly runtime page reclamation, and GPU straggler effects—that arise when non-uniform KV compression is applied on top of existing serving stacks like vLLM.

Why does non-uniform KV compression cause problems in existing serving systems?

Existing systems like vLLM use PagedAttention, which assumes all attention heads have identical KV cache lengths. Non-uniform compression violates this assumption, causing three concrete problems: memory fragmentation because individual head pages cannot be independently reclaimed, up to 25% of prefill time consumed by compress-and-reclaim overhead, and GPU SM imbalance (straggler effects) because heads with larger retained caches take longer to process.

How does Tangram's Deterministic Budget Allocation work?

Tangram profiles each attention head's KV retention needs offline using a small pilot set of samples, then fixes a per-head memory budget computed as the mean retention plus a safety margin (α times the standard deviation) to handle occasional spikes. This replaces runtime compression decisions with pre-calculated, static memory footprints, eliminating the need for any runtime page reclamation.

What is the Head Group Page structure in Tangram?

Head Group Page clusters attention heads with similar KV budget sizes into groups, each with its own independent page table, so that memory can be reclaimed at the group level rather than being locked in a monolithic structure. To keep management efficient, the per-group tables are aggregated into a vectorized block table processed with OpenMP and AVX-512 SIMD lanes, reducing overhead from O(H) per-head work.

How does Ahead-of-Time (AOT) Load Balancing work and how does it differ from dynamic approaches?

AOT Load Balancing uses the static per-head budgets computed offline to pre-calculate the exact number of GPU Cooperative Thread Array (CTA) blocks needed for each head group, so the attention kernel launches with a balanced workload without any per-step planning. Dynamic schemes, by contrast, recompute partitions at every decoding step, incurring CPU latency proportional to the number of layers L and head groups, which dominates the critical path.

Why does non-uniform KV compression cause GPU straggler effects?

Attention kernels parallelize work across heads, so heads with larger retained KV caches take longer to process than those with smaller caches. When workload partitioning assumes uniform lengths, the GPU SMs assigned to the heavier heads become bottlenecks, forcing the entire system to wait for them to finish before proceeding.

Why does Tangram add a safety margin (α·σ) to the mean retention when setting budgets?

Using only the mean retention μ would ignore observed variance; on inputs where a critical head occasionally needs extra slots, a mean-only budget could truncate essential context and cause a measurable drop in multi-turn accuracy. The α·σ term cushions those occasional spikes without requiring runtime adjustment.

Does Tangram's offline profiling generalize across different user inputs and domains?

Yes, according to the paper. The authors observe that while absolute retention values may shift slightly across domains, the relative ranking of head importance is stable and model-intrinsic, so a small pilot set of samples is sufficient to calibrate budgets that remain robust across diverse multi-turn tasks.

What benchmarks and models were used to evaluate Tangram?

Tangram is evaluated on three large language models across four long-context benchmarks, including SCbench, RealTalk, and LoCoMo. The paper does not specify the exact model names in the provided text.

What are Tangram's key quantitative results?

Tangram improves serving throughput by up to 2.6× compared to the vLLM baseline while maintaining model accuracy. Deterministic Budget Allocation eliminates the page-reclamation overhead that previously consumed up to 25% of prefill time, and ablations confirm that each of the three components—Deterministic Budget Allocation, Head Group Page, and AOT Load Balancing—contributes additive throughput gains.

How does Tangram compare to prior non-uniform KV compression methods in terms of accuracy?

The paper states that Tangram matches or exceeds the accuracy of prior non-uniform KV compression methods while delivering substantial system-level speedups. The paper does not report specific accuracy numbers for individual baselines in the provided text.

How does Tangram differ from vLLM and its PagedAttention approach?

vLLM's PagedAttention assumes all attention heads have identical KV cache lengths and uses a monolithic page structure that cannot reclaim memory from individual heads. Tangram replaces this with head-group-specific page tables, static pre-allocated budgets, and pre-computed GPU workload partitions, removing the runtime overhead that PagedAttention-based systems incur under non-uniform compression.

How does Tangram relate to prior work on heterogeneous memory management such as Jenga?

The paper cites Jenga as a related approach that demonstrates per-layer heterogeneity for hybrid architectures, showing that distinct memory shares per layer can boost accuracy while keeping memory use efficient. Tangram extends this idea to per-head granularity within a full serving system optimized for multi-turn workloads.

What are the limitations or open questions acknowledged by the paper?

The paper does not explicitly enumerate limitations, but it notes that absolute retention values may shift slightly across domains, requiring a calibration pilot set. The paper also does not discuss how Tangram handles models or workloads where head retention patterns are not stable, nor does it address potential overhead of the offline profiling stage itself.

Who are the authors of Tangram and where was it published?

The paper does not state the authors' names or the publication venue in the provided text. It is available at arxiv.org/abs/2606.06302.

How would a practitioner reproduce or apply Tangram?

A practitioner would first run an offline profiling pass on a small pilot dataset to measure per-head KV retention statistics (mean and standard deviation per layer and head), then use these to pre-allocate head-group page tables and pre-compute AOT CTA splits before serving begins. The paper does not specify whether code or a reference implementation is publicly released.

Key terms

KV cache: A GPU memory structure that stores the Key and Value attention states for every previously seen token, allowing the model to avoid recomputing them on each new generation step.
Non-uniform KV compression: A technique that allows each attention head to retain a different number of KV entries based on its individual importance, rather than forcing all heads to keep the same fraction of tokens.
PagedAttention: A memory management scheme used in vLLM that stores KV cache entries in fixed-size pages, assuming all attention heads have the same KV cache length.
Deterministic Budget Allocation: Tangram's technique of computing each attention head's memory requirement offline from profiling statistics, so that exact memory can be pre-allocated before inference begins without any runtime adjustment.
Head Group Page: Tangram's memory structure that clusters attention heads with similar KV budget sizes into groups, each with its own independent page table, enabling fine-grained memory reclamation.
Ahead-of-Time (AOT) Load Balancing: Tangram's technique of pre-computing the GPU workload partition (number of CTA blocks per head group) offline using static budgets, eliminating per-decoding-step planning overhead.
Straggler effect: A GPU performance problem where some thread blocks or SMs take much longer than others due to unequal workloads, forcing the entire kernel to wait for the slowest unit to finish.
Streaming Multiprocessor (SM): The fundamental parallel processing unit on an NVIDIA GPU that executes groups of threads; imbalanced workloads leave some SMs idle while others are overloaded.
Cooperative Thread Array (CTA): A group of GPU threads that execute together and can share on-chip memory; the number of CTAs launched determines how work is distributed across SMs.
FlashAttention-2: An optimized attention kernel that fuses query-key multiplication with value aggregation to reduce memory bandwidth usage and improve GPU utilization.
FlashDecoding: An attention kernel optimization that adds parallelism along the KV sequence dimension using static heuristics to partition work across SMs.
FlashInfer: An attention kernel that performs a runtime planning phase to dynamically choose optimal KV-dimension splits for parallelism.
Continuous Batching: A serving technique that dynamically admits new requests into a running batch as soon as GPU memory is available, improving hardware utilization compared to static batching.
Multi-turn LLM serving: The deployment scenario where a language model must maintain and process the full conversation history across multiple user-assistant exchanges, causing KV cache size to grow with each turn.
AVX-512 SIMD: A CPU instruction set extension that processes 512 bits of data in a single instruction, used by Tangram to vectorize block table updates across head groups efficiently.
SCbench / RealTalk / LoCoMo: Long-context multi-turn benchmarks used in the paper to evaluate how well serving systems and compression methods preserve accuracy as dialogue history grows.
Safety margin (α·σ): A buffer added to the mean KV retention budget, scaled by a factor α times the standard deviation σ, to prevent truncation of context on inputs where a head's retention occasionally exceeds its average.

Read the original paper

Open the simplified reader on Paperglide