EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom accelerates Video-LLMs by compressing visual tokens inside the vision encoder to slash time-to-first-token.

How can we reduce the latency of video LLM inference by compressing visual tokens early in the vision encoder, rather than waiting until the LLM prefill stage?

Video Large Language Models (Video-LLMs) are bottlenecked by the vision encoder, which consumes over half of the time-to-first-token (TTFT) in existing optimized systems. EarlyTom introduces a training-free compression framework that merges redundant frames and selects tokens directly inside the vision encoder, rather than waiting until the prefilling stage. This approach reduces TTFT by up to 2.65× and FLOPs by 61% on LLaVA-OneVision-7B, while maintaining accuracy comparable to full-token baselines.

Paper Primer

Most existing compression methods treat the vision encoder as a black box, compressing tokens only after they are generated. EarlyTom recognizes that the encoder itself is the primary latency bottleneck and moves the compression logic upstream into the encoding process.

EarlyTom is a two-part pipeline: it first merges redundant frames using a streaming similarity-based strategy, then applies a decoupled spatial selection that separates dynamic and static tokens to avoid the bias introduced by "attention sinks."

EarlyTom achieves a 2.65× reduction in time-to-first-token (TTFT) compared to full-token baselines.

Measured on LLaVA-OneVision-7B using a single NVIDIA A100 GPU.

The method maintains high accuracy despite aggressive compression.

Maintains >96% of baseline accuracy across MVBench, EgoSchema, LongVideoBench, and VideoMME. Only a 4% performance drop at 10% retention, compared to nearly 9% for competing methods.

Why does compressing tokens inside the vision encoder matter more than compressing them later?

Vision encoding accounts for 36% to 68% of total TTFT in current models. By compressing early, the system reduces the number of tokens that must be processed by the subsequent LLM prefilling stage, fundamentally lowering the computational load of the entire pipeline.

What is the "attention sink" problem, and how does this method avoid it?

Certain spatial patch locations consistently receive abnormally high attention scores regardless of visual content. EarlyTom uses a decoupled sampling strategy that treats static and dynamic frames differently, preventing these structural attractors from biasing the token selection process.

Researchers and engineers can now treat the vision encoder as a target for optimization rather than a fixed cost, enabling real-time video understanding on compute-constrained hardware.

Abstract

EarlyTom compresses visual tokens inside the encoder, cutting latency and compute dramatically.

Video large language models excel at video understanding, but processing massive visual tokens makes deployment costly. Moreover, the vision encoder dominates the time‑to‑first‑token latency. Existing methods compress tokens only after the encoder, leaving this bottleneck untouched.

EarlyTom addresses this by performing training‑free token merging and spatial token selection early within the vision encoder. Its decoupled spatial token selection improves overall compression effectiveness.

Despite these gains, accuracy remains comparable to the full‑token baseline, making Video‑LLMs far more practical for real‑world deployment.

The Vision Encoder Bottleneck

Vision encoding dominates latency, so early token compression is essential.

Video‑LLMs have shown strong video‑understanding abilities, yet processing thousands of visual tokens per frame makes inference costly and hampers real‑world deployment.

**Figure 1.** **Left:** This paper aims to improve the inference efficiency of video understanding based on video large language models (LLMs). Latency profiling suggests the major speed bottleneck lies in the vision encoder part instead of the LLM. Knowing this, we introduce EarlyTom, a training-free token compression method designed for the early stage (i.e., vision encoder) of video LLMs. EarlyTom features two core components: (1) early-stage visual token compression achieved via inner vision encoder frame merging, and (2) a spatial token selection strategy that further increases compression effectiveness without introducing bias. **Right:** Scatter plot illustrating the relationship between FLOPs and throughput, along with the average performance across four widely used video understanding benchmarks (MVBench, EgoSchema, LongVideoBench, and VideoMME) for several training-free state-of-the-art methods. EarlyTom achieves state-of-the-art performance while maintaining accuracy comparable to full-token methods.

Profiling the latency breakdown (Figure 3) shows that vision encoding alone consumes 36.3 % of the total Time‑to‑First‑Token (TTFT) in a baseline system, and this share grows to 55.8 % for HoliTom and 68.4 % for VisionZip.

**Figure 3. Time-to-first-token (TTFT) latency composition.** We break down TTFT into four parts: vision encoding, visual token processing, LLM prefill, and system overhead. In the baseline, vision encoding takes 323 ms, accounting for 36.3% of the total, indicating that this stage still has substantial room for optimization. For state-of-the-art methods like HoliTom and VisionZip, vision encoding remains the largest component, occupying 55.8% (324 ms) and 68.4% (325 ms), respectively. In addition, HoliTom introduces extra token-processing overhead, increasing this component by 121.9% (+78 ms) compared to the baseline. In contrast, our method reduces vision encoding time directly inside the encoder, achieving a 2.65× TTFT reduction over the baseline while adding almost no additional overhead, evaluated under 10% token retention on an NVIDIA A100 GPU.

TTFT measures the wall‑clock delay from sending a prompt to receiving the first generated token; it is the user‑perceived start‑up latency of a video‑LLM.

Tokens per frame = 64; frames = 60 → total tokens = 3 840.

Vision‑encoder cost = 3 840 × 0.3 ms = 1 152 ms.

TTFT = vision + other stages ≈ 1 152 ms + 800 ms = 1 952 ms.

Even with modest per‑token cost, the sheer number of visual tokens makes vision encoding the dominant latency contributor.

The vision encoder processes redundant visual information; most of the tokens it emits are highly similar across consecutive frames, yet the encoder still spends full compute on each.

Prior token‑compression work either acts after the vision encoder (e.g., VisionZip, LLaVAPruMerge) or inside the LLM (FastV, SparseVLM, Pyramid‑Drop). Hybrid schemes such as HoliTom, FastVID, and DyCoke combine both stages but still leave a sizable vision‑encoding cost.

Vision encoding is the primary bottleneck for TTFT in current Video‑LLMs.

Existing Token Compression Methods

We summarize prior token‑compression techniques for vision encoders and LLM pipelines.

Intra-encoder token compression reduces the number of visual tokens before they reach the language model. By merging or discarding redundant tokens early, these methods lower the compute burden of the vision encoder.

Tokens that are highly similar are merged into a single representative, so the encoder processes fewer tokens without losing salient information.

Introduces an energy score that preserves informative tokens while merging large, similar clusters.

Uses attention scores from a [CLS] token to select cluster centers, then merges remaining tokens via K‑nearest‑neighbor clustering.

Tokens with higher attention scores are kept, and the rest are grouped and merged, yielding a compact visual token set.

Combines multi‑dimensional redundancy evaluation, token‑adaptive matching, and weighted fusion in a filtering‑association‑compression pipeline.

Merges similar neighborhood tokens while preserving key visual tokens, using dual‑attention filtering during the prefilling stage.

Provides a coarse‑to‑fine visual projector: low‑resolution point queries are generated first, then refined with high‑resolution features via a region‑to‑point module.

Builds augmented samples and trains with token merging to improve efficiency.

Pre‑LLM token compression operates after the vision encoder, treating compression as a plug‑and‑play module before the language model. These methods aim to prune redundant visual information while preserving the cues needed for downstream reasoning.

Frames are segmented dynamically and pruned based on a local “information density” metric, so only visually informative frames are kept.

Two‑stage pipeline that merges redundant frame tokens temporally and then prunes the KV cache dynamically during decoding.

Progressively encodes each frame and adaptively compresses redundant tokens by exploiting temporal redundancy.

Empirical study of LLM handling of visual tokens, introducing fine‑grained pruning at intermediate model layers.

Performs holistic spatio‑temporal segmentation and merging, using a query‑aware inner‑LLM method to achieve extreme compression.

EarlyTom Framework Overview

EarlyTom compresses video tokens early to cut latency without hurting accuracy.

The vision encoder consumes most of the Time‑to‑First‑Token (TTFT) budget, so shaving work there yields the biggest latency win.

Instead of processing every incoming frame, the encoder merges temporally redundant frames on‑the‑fly, keeping only a compact representation of the video’s motion.

Identify the redundant segment: frames F₂–F₄ all exceed the threshold with their neighbors, so they form a merge group.

Compute a weighted average of their embeddings: w₁·F₂ + w₂·F₃ + w₃·F₄ with weights proportional to similarity (0.92, 0.95, 0.96).

Replace the three frames by a single merged token M, yielding the shortened sequence [F₁, M, F₅].

Pass the three‑token sequence to the downstream LLM prefilling stage.

Early merging cuts the token count by 40 % before any LLM work begins, directly shrinking the dominant TTFT component.

How does this differ from naïve frame sampling, which simply drops every other frame?

Sampling discards frames indiscriminately, potentially losing subtle motion cues. EarlyTom’s similarity‑driven merging keeps the most informative content by averaging only truly redundant frames, preserving motion continuity while still reducing token count.

After encoding, tokens are split into dynamic and static groups; each group is pruned with a strategy that respects its role, so we keep motion‑rich tokens densely and preserve spatial coverage for static background.

Dynamic set = {t₁, t₃, t₅, t₈}. Apply global Top‑K = 2 → keep the two highest variances (t₁ = 0.9, t₈ = 0.6).

Static set = {t₂, t₄, t₆, t₇, t₉, t₁₀, t₁₁, t₁₂}. Partition the spatial layout into 2 × 2 windows (each window holds 3 tokens). Within each window pick the token with the largest variance.

Selected static tokens: t₂ (0.2), t₆ (0.15), t₁₁ (0.3), t₁₂ (0.02).

Concatenate selected dynamic and static tokens → final token set {t₁, t₈, t₂, t₆, t₁₁, t₁₂} (6 tokens total).

The scheme halves the token count while guaranteeing that every spatial region contributes at least one token, and that the most motion‑rich tokens survive.

Why not simply apply Top‑K globally to all tokens, as many pruning methods do?

Global Top‑K would discard many low‑variance static tokens that are essential for preserving background layout, leading to spatial holes. Decoupling lets us keep a uniform spatial scaffold via local‑window selection while still focusing the budget on dynamic content.

**Figure 4.** Overall pipeline of EarlyTom. Our method consists of two main stages for efficient video token compression. Stage I: Inner-vision encoder frame merging performs temporal compression inside the vision encoder. The video is adaptively segmented based on streaming frame similarity, redundant middle frames are merged using a local-optimal criterion, and merged representations are further refined with weighted fusion to reduce early-stage temporal redundancy. Stage II: Decoupling selection conducts spatial token reduction after vision encoding. Merged frame features are decomposed into dynamic and static token sets: dynamic frames undergo global Top-K selection, while static frames use local-window selection to preserve spatial distribution. The selected tokens from both paths are recombined and fed into the LLM for decoding. Together, these two stages enable early temporal compression and balanced spatial sampling, significantly accelerating Video LLM inference while maintaining semantic fidelity.

Together, these two components cut the vision‑encoding and token‑processing portions of TTFT dramatically while keeping the semantic content needed for high‑quality generation.

Inner-Encoder Frame Merging

Compress redundant visual frames early to cut inference latency.

Redundant video frames inflate the vision encoder’s workload, inflating Time‑to‑First‑Token latency; eliminating that redundancy before the LLM sees the visual tokens yields a large speedup.

We split the video into homogeneous segments by watching how quickly the visual content changes—when the similarity between consecutive frames drops, a new segment starts.

Why not use the raw similarity $s_t$ directly for segmentation?

Raw $s_t$ fluctuates due to per‑token noise and occasional motion spikes; the EMA $\hat{s}_t$ filters out these spikes, preventing over‑segmentation on spurious changes.

Within each homogeneous segment we keep only the most representative frames by merging consecutive frames that are more similar to each other than to their neighbors.

Why require $s_i > s_{i+1}$ in addition to $s_i > \tau_{\text{merge}}$?

Without the local‑peak condition, a long stretch of uniformly high similarity could cause every adjacent pair to merge, collapsing the segment to a single frame and losing temporal nuance.

When two frames are merged we keep a weighted average of their features, giving more influence to the frame that is more similar to its neighbor.

Would a simple unweighted average work just as well?

No—unweighted averaging treats a low‑similarity frame as equally important, which can blur motion cues; weighting preserves the dominant visual signal.

After frame merging, label each segment as either dynamic (contains motion) or static (mostly background).

For every dynamic frame $F^{d}_i$, compute per‑token attention scores $A_i$ and keep the top‑$ \hat{r}$ fraction of tokens globally (Equation 4).

Rescale the selection ratio $\hat{r}$ using the compression target $r$ and the original frame count $B$ (Equation 5).

For static frames, split tokens into $M$ equal‑size windows of width $w$ (Equation 6) and keep the highest‑scoring token in each window.

Concatenate the retained dynamic tokens $\hat{F}^{d}$ and static tokens $\hat{F}^{s}$ in their original temporal order (Equation 7) to form $\hat{F}$.

Feed $\hat{F}$ to the LLM decoder.

EMA smoothing: $\hat{s}_0=0.92$, $\hat{s}_1=0.6\cdot0.85+0.4\cdot0.92=0.88$, $\hat{s}_2=0.6\cdot0.78+0.4\cdot0.88=0.82$.

Segment break: $\hat{s}_2=0.82$ is still above $\tau_{\text{seg}}$, so all four frames stay in one segment.

Middle‑frame merge: $s_0=0.92>\tau_{\text{merge}}$ and $s_0>s_1$, so merge $F_0$ and $F_1$ into $\hat{F}_{0,1}$.

Weighted merge: $\hat{F}_{0,1}= (0.92F_0+0.85F_1)/(0.92+0.85)$.

Dynamic/static labeling: assume $F_2$ is static, $F_3$ dynamic.

Dynamic top‑K: with $r=0.5$, $B=4$, $\hat{r}=0.5/((4-2)/4)=0.67$ → keep 2 of 3 tokens in $F_3$.

Static window top‑K: $w = L/\hat{r}=3/0.67\approx4.5$, so a single window; keep the highest‑scoring token of $F_2$.

Concatenation: order $\hat{F}_{0,1}$, token from $F_2$, two tokens from $F_3$ → final $\hat{F}$.

The EMA‑driven segmenter avoids over‑splitting on brief motion spikes, while the weighted merge preserves the dominant visual content; together they enable aggressive token pruning without losing temporal coherence.

**Figure 5.** Frames compression and distribution of features. (a) Illustrates the cosine similarity changes across different frame indices for network layers at indices 6 and 20 during frame compression in the vision encoder. (b) The distribution of raw tokens, top-K sampling, and our method. This subfigure shows that our method is closer to vanilla top-K selection.

Video Sink Tokens

We detail the video‑sink‑token problem, our decoupled sampling solution, and experimental setup.

Existing video‑LLM pipelines treat all visual tokens uniformly, so “sink” tokens that constantly dominate the attention distribution waste compute and limit context diversity.

**Figure 2.** The video sink tokens. We visualize videos across datasets to illustrate the video attention sinking phenomenon: certain tokens (specific frames/regions) consistently attract disproportionately high attention (as shown in the attention score heatmaps), revealing that existing top-K-based token compression methods overlook semantic information in other frames and limit video context understanding.

Some visual tokens become permanent attention magnets: they attract a large share of the attention budget in every frame, even though they carry little new information.

Frame 1: token 5 gets $0.9$ of the attention budget; remaining $0.1$ is split among the other seven tokens.

Frame 2: token 5 again captures $0.85$, leaving $0.15$ for the rest.

After four frames, token 5 has consumed roughly $3.5$ of the total $4$ attention units, while each other token has received less than $0.1$ on average.

Aggregating across frames, the model’s effective representation is dominated by the features of token 5, despite its redundancy.

Even a single redundant token can monopolize the attention budget, dramatically inflating compute without adding new visual content.

How do “video sink tokens” differ from simply “high‑attention tokens” that appear in any image?

Sink tokens are *persistent* across time: the same token ID repeatedly dominates in every frame, whereas ordinary high‑attention tokens may be salient only in specific frames. This temporal consistency makes them a systematic inefficiency rather than an occasional focus.

To break this pattern we introduce a decoupled sampling strategy: frames are split into a dynamic subset $\hat{F}_d$ (head and tail frames) and a static subset $\hat{F}_s$ (middle frames), each sampled with a different token‑selection scheme.

Benchmarks include MVBench, EgoSchema, LongVideoBench, and VideoMME; we report TTFT, throughput, and FLOPs to capture latency and compute efficiency.

Implementation builds on LLaVA‑OneVision (7B) with a custom Triton kernel; all experiments run on NVIDIA A100 or RTX 4090 GPUs, and TTFT is measured via Nsight Systems.

Table 1 summarizes the accuracy‑efficiency trade‑off of our method against prior token‑compression baselines; Table 2 extends the comparison across different vision backbones.

Performance and Efficiency Results

EarlyTom dramatically cuts latency and FLOPs while preserving accuracy.

EarlyTom lowers Time‑to‑First‑Token by 2.65× compared with the best prior training‑free method.

Table 1 shows EarlyTom’s TTFT = 336.2 ms at a 10 % retained‑token ratio, whereas the next‑best method records ≈ 458 ms.

**Table 1.** Performance and accuracy comparison with SoTA methods across benchmarks. Best results are in bold, second-best results are underlined. Time-to-first-token is denoted as TTFT for simplicity. All efficiency results are measured on a single NVIDIA A100 GPU.

**Table 2.** Cross-backbone comparison on performance and accuracy. Best results are in bold, second-best results are underlined. Time-to-first-token is denoted as TTFT for simplicity. All efficiency results are measured on a single NVIDIA A100 GPU.

**Table.** EarlyTom is architecture-agnostic and capable of delivering strong acceleration without sacrificing quality.

**Table.** Comparison of different sampling strategies.

EarlyTom achieves superior throughput and lower TTFT without sacrificing accuracy.

Generalizability Analysis

EarlyTom’s efficiency, accuracy, and component ablations are evaluated on LLaVA‑Video‑7B and Qwen2.5‑VL‑7B.

We evaluate EarlyTom on two video‑LLM backbones, LLaVA‑Video‑7B and Qwen2.5‑VL‑7B, reporting both efficiency (FLOPs, $TTFT$) and accuracy, and then ablate its two core modules.

**Table.** Hyperparameters for TTFT latency decomposition analysis.

**Table 7.** Experiment results on trivial baselines and ablation studies. All results are obtained on Qwen2.5-VL-7B with a maximum of 768 frames and a retain ratio of 15%. Efficiency metrics are measured under a 23k-token context length on a single NVIDIA A100 GPU.

Removing decoupled spatial token selection drops the average score on Qwen2.5‑VL‑7B from 62.2 % to 61.4 %.

Ablation results in Table 7 show 61.4 % without spatial selection versus 62.2 % with the full system.

Removing weighted frame merging drops the average score on Qwen2.5‑VL‑7B from 62.2 % to 61.3 %.

Ablation results in Table 7 show 61.3 % without weighted merging versus 62.2 % with it.

Supplementary Results

Key ablations reveal which components drive efficiency and accuracy gains.

**Table 4.** Ablation study of different token sampling ways. We report the throughput and accuracy of three video tasks. In all results, we set the retain ratio to 0.2.

EarlyTom achieves a FLOPs‑ratio of 15 % versus 100 % for the vanilla pipeline.

Table 4 reports 15 % FLOPs for EarlyTom and 100 % for the baseline.

EarlyTom reduces Time‑to‑First‑Token to 86.4 ms, a 160 ms improvement over the baseline.

Baseline TTFT is 246.2 ms (Table 4); EarlyTom’s TTFT is 86.4 ms.

FastV lowers TTFT to 158.2 ms, cutting latency by roughly 88 ms.

Baseline TTFT 246.2 ms; FastV TTFT 158.2 ms (Table 4).

PyramidDrop raises throughput to 3 494.8 tokens/s, the highest among the non‑EarlyTom variants.

Throughput values in Table 4 show PyramidDrop at 3 494.8 tokens/s.

VisionZip attains a score % of 56.7 %, matching PyramidDrop while using only 15 % of the FLOPs.

Score % column in Table 4 lists 56.7 % for VisionZip.

HoliTom improves MVBench accuracy to 17.1, the best among all ablations.

MVBench column in Table 4 records 17.1 for HoliTom.

Prefilling ablations (Table 7) show that removing the decoupled spatial token selection or the weighted frame merging both lower FLOPs and TTFT, but the full EarlyTom configuration yields the lowest FLOPs (12.2 %) and the fastest TTFT (3 667 ms).

Detailed TTFT Latency Decomposition

We dissect TTFT latency, exposing encoder bottlenecks and EarlyTom’s gains.

Recall that inference latency is dominated by the vision encoder’s processing of redundant visual tokens. This section visualizes how EarlyTom’s early‑stage compression reshapes the TTFT breakdown.

**Figure 7.** Time-to-first-token (TTFT) comparison on the LLaVA-OneVision-7B model. We report the latency breakdown (vision encoding, visual token processing, LLM prefill, and system overhead) across different methods.

**Figure 8.** Time-to-first-token (TTFT) comparison on the LLaVA-OneVision-0.5B model. We report the latency breakdown (vision encoding, token processing, LLM prefill, and system overhead) across different methods.

The attention‑sink phenomenon appears as persistent vertical stripes in encoder heatmaps, indicating static spatial tokens that dominate attention regardless of frame dynamics. EarlyTom’s decoupled spatial token selection treats these sinks separately from dynamic regions, preserving essential structure while still discarding redundant information.

Implementation Pseudocode

Implementation details of EarlyTom’s core algorithms.

EarlyTom is built around two algorithmic stages. The first stage merges redundant video frames, and the second stage selects spatial tokens while preserving dynamic content.

Inner‑Vision Encoder Frame Merging

**Algorithm 1 Inner-Vision Encoder Frame Merging** **Input:** Frame features $F \in \mathbb{R}^{B \times L \times D}$, hyperparameters $\alpha, \tau_{\text{seg}}, \tau_{\text{merge}}$. **Output:** Merged frame features $\hat{F}_{\text{out}} \in \mathbb{R}^{N \times L \times D}$. *Streaming Frame Segmentation in Equation (1)* $\mathcal{S} \leftarrow \text{SegmentBySimilarity}(F, \alpha, \tau_{\text{seg}})$, $F_{\text{merged\_list}} \leftarrow []$ **for** each segment $S_{\text{seg}} = \{F_0, \dots, F_k\}$ in $\mathcal{S}$ **do** $F_{\text{mid}} \leftarrow [], i \leftarrow 1$ *Iterate over Middle Frames within the Segment* **while** $i < k$ **do** *Compute Pairwise Frame Similarities* $s_i \leftarrow \text{Sim}(F_i, F_{i+1}), s_{i+1} \leftarrow \text{Sim}(F_{i+1}, F_{i+2})$ *Middle Frame Merge Condition in Equation (2)* **if** $s_i > \tau_{\text{merge}}$ and $s_i > s_{i+1}$ **then** *Weighted Frame Merge in Equation (3)* $\hat{F}_m \leftarrow \text{WeightedMerge}(s_i, F_i, s_{i+1}, F_{i+1})$ $F_{\text{mid}}.\text{append}(\hat{F}_m); i \leftarrow i + 2$ **else** $F_{\text{mid}}.\text{append}(F_i); i \leftarrow i + 1$ **end if** **end while** *Assemble Merged Segment* $F_{\text{seg\_out}} \leftarrow \text{Concat}(F_0, F_{\text{mid}}, F_k)$ $F_{\text{merged\_list}}.\text{append}(F_{\text{seg\_out}})$ **end for** *Concatenate All Merged Segments* $\hat{F}_{\text{out}} \leftarrow \text{Concatenate}(F_{\text{merged\_list}})$ **Return** $\hat{F}_{\text{out}}$

Read the original paper

Open the simplified reader on Paperglide