X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

X-Stream is the first benchmark for multi-stream video understanding, revealing that current MLLMs struggle to integrate concurrent visual inputs.

How can we effectively evaluate and enable MLLMs to process and reason across multiple simultaneous video streams?

Existing video benchmarks focus on single-stream processing, leaving a gap in evaluating models that must reason across multiple simultaneous video feeds like live sports or autonomous driving. The authors introduce X-Stream, a benchmark of 4,220 QA pairs that forces models to synthesize information across multiple streams using a dual-verification pipeline to prevent reliance on single-stream shortcuts. State-of-the-art Multimodal Large Language Models (MLLMs) achieve only ~50% accuracy on this benchmark, demonstrating that current architectures lack the proactive reasoning required for multi-stream environments.

Paper Primer

The authors conceptualize MLLMs as "naive multiplexers" that must compress multiple video streams into a single token sequence. They evaluate three strategies: Spatial Division (stitching frames together), Time Division (interleaving frames), and Semantic Division (pruning tokens based on relevance and diversity).

Current MLLMs fail to perform robust multi-stream reasoning.

Performance on the X-Stream benchmark across 11 subtasks, including causal reasoning and behavior planning.

The effectiveness of multiplexing strategies is highly sensitive to constraints: Spatial Division excels at cross-stream referencing, while Semantic Division is necessary to preserve critical information when scaling to three or more streams under tight token budgets.

Why does this benchmark require a "dual-verification" pipeline?

To prevent "single-stream shortcuts," where models answer questions using only one stream despite the presence of multiple inputs. The pipeline ensures that answers are both sufficient (correct with multi-stream input) and necessary (incorrect if only one stream is provided).

What is the primary bottleneck for models in this multi-stream setting?

Models perform well on foundational perception tasks but struggle significantly with high-level logical cognition, such as causal reasoning and complex decision-making, which require synthesizing fragmented clues across streams.

Paper Primer

We introduce X‑Stream, a multi‑stream video QA benchmark and assess MLLMs as naive multiplexers.

X‑Stream is the first benchmark dedicated to multi‑stream streaming understanding, comprising 4,220 rigorously curated QA pairs drawn from 932 videos.

The dataset is built with a dual‑verification pipeline that prevents over‑reliance on any single stream, ensuring robust cross‑stream reasoning.

We treat multimodal large language models (MLLMs) as naive multiplexers and evaluate them through Signal Multiplexing Theory.

Beyond low accuracy, state‑of‑the‑art MLLMs also show poor proactive ability when processing concurrent streams.

Our analysis exposes a trade‑off in current multiplexing schemes and offers a practical evaluation protocol together with empirical guidance for future multi‑stream agents.

**Fig. B10:** Distribution of questions (after processing).

**Fig. B11:** Word cloud of the free-form answers (after processing).

**Fig. B12:** Examples of Human annotation interfaces in MTurk. (a) if the annotator chooses timestamp correction, the correct time range should also be provided. (b) if the annotator chooses question error, the error reason range should also be provided.

**Fig. B13:** Error type distribution on the human evaluation.

**Table B12.** Human correction statistics on the evaluation set.

**Table B13.** The licenses for the multi-domain video/data sources used in our study.

This table categorizes mechanisms of pseudo-reference and pseudo-information redundancy in multi-stream video analysis. It includes columns for Category, Mechanism, Description, and Example.

**Table.** Comparison of spatial division strategies for Qwen-3-Omni-32B-A3B and Qwen-3-VL-32B-A3B models.

The Multi-Stream Challenge

Introducing the need to move from single‑video to multi‑stream reasoning.

Large Language Models such as ChatGPT, Gemini, and Claude have progressed from research prototypes to everyday tools, now handling real‑time single‑stream video by incrementally ingesting text and frames.

Yet many practical scenarios—office multi‑screen coordination, live‑sport broadcasting, navigation with maps and smart glasses, or synchronized shoulder‑ and wrist‑camera streams on robotic arms—demand simultaneous processing of dozens of video feeds.

Consider a World Cup broadcast with over 40 cameras; the system must automatically pick the optimal view in real time, a problem that cannot be solved by a single‑stream pipeline.

Existing multi‑video datasets lack true streaming characteristics, long durations, and precise timestamps, and they often let models rely on a single‑stream shortcut, hampering genuine multi‑stream perception.

It is the ability to jointly reason over several concurrent video streams, fusing their visual cues so that answers depend on information spread across streams rather than any single one.

Single‑stream matrix size: $8\times8=64$ entries; at 4 bytes each → $256$ bytes.

Four‑stream matrix size: $32\times32=1{,}024$ entries; at 4 bytes each → $4{,}096$ bytes.

The memory grows by a factor of $1{,}024/64 = 16$, illustrating how naïve scaling quickly exhausts token bandwidth.

This toy calculation shows why naïve concatenation of streams is infeasible without a clever multiplexing strategy.

**Fig. 1:** Our $X$-Stream, as the first multi-stream streaming benchmark, encompasses a diverse range of scenarios featuring multi-angle, multi-view, and multi-device capabilities. ⚖️ and ⚖️ mean balanced and imbalanced streams. ♻️ and ♻️ mean the same domain and different domain streams. 🌍 and 🤖 mean the real-world and synthesized pairs.

**Fig. 2:** The illustration of the multi-streaming task. Fig.(a) and (b) showcase the practical examples in daily life. Essentially, the multi-streaming task involves multiple videos with temporal constraints and alignment, requiring the synchronization of video timestamps, as shown in Fig.(c). However, compared to multi-view and multi-angle, it also necessitates important streaming properties to fit the online applications.

The shift from single‑video to multi‑stream reasoning is the central challenge this work tackles.

Related Work

Survey of existing MLLM video work and the gap in multi‑stream streaming.

Recent MLLMs have achieved strong video perception, with closed‑source systems such as GPT‑5, Gemini 3 Pro, and Doubao‑2.0 leading benchmarks, while open‑source alternatives like InternVL 3.5, MiniCPM‑V 4.5, Qwen 3.5, and DeepSeek‑VL2 close the gap.

These models excel across video sub‑domains—general comprehension, spatial reasoning, and temporal reasoning—but remain limited to offline processing of complete videos, lacking online multi‑stream inference.

Streaming video understanding emerged later, building on early audio‑streaming work that enabled real‑time interaction.

Video streaming faces token‑length bottlenecks, prompting two research strands: efficient architectures (VideoLLM‑online, StreamingVLM, Streamo) and interaction‑aware designs such as Dispider’s asynchronous pipeline and MMDuet2’s multi‑turn reinforcement learning.

Large‑scale streaming datasets (HoloAssist, EgoBlind) and benchmarks (StreamingBench, OVO‑Bench, PhoStream, OmniMMI, SVBench) evaluate general streaming and proactive reasoning, while ProactiveVideoQA targets user‑centric interaction; however, all focus on single‑stream scenarios.

The literature on multi‑video and multi‑view understanding organizes tasks in a pyramid: Multi‑Video (fewest constraints), Multi‑Stream (timestamp alignment), Multi‑View (multiple perspectives of the same activity), and Multi‑Angle (different angles of the same subject simultaneously).

Representative works span the hierarchy: MVU‑Bench and video‑differencing for Multi‑Video; EgoLife, Wod‑e2e, Seamless‑interaction, and NuPlanQA for Multi‑View; Assembly101, EgoExo4D, and All‑Angle Bench for Multi‑Angle.

Although Multi‑Angle and Multi‑View constitute a small fraction of the broader Multi‑Stream category, prior approaches have evaluated them on full video files rather than in an online, real‑time setting, leaving multi‑stream streaming understanding largely unexplored.

Benchmark Design

We describe how the X‑Stream benchmark is built from multi‑stream video data and what evaluation tasks it defines.

The lack of large‑scale, multi‑stream video corpora hampers progress on models that must reason across concurrent streams; our construction pipeline fills that gap by curating and annotating diverse video streams into a unified benchmark.

The benchmark treats several simultaneous video streams as a single input, requiring a model to track, align, and combine information across streams to answer questions.

How does the X‑Stream Benchmark differ from conventional video‑QA datasets?

Standard video‑QA benchmarks present a single visual stream and ask questions about that stream alone. X‑Stream stitches together multiple concurrent streams, so a model must simultaneously attend to, filter, and fuse information from several videos to produce an answer—introducing a combinatorial reasoning challenge absent from single‑stream datasets.

At t = 0 s, the model observes the first frames of both streams (no decision yet).

At t = 2 s, Stream A shows the car within 5 m of the intersection; Stream B shows the pedestrian still waiting.

The model flags the forward condition as unmet (car has not stopped, pedestrian not crossing).

At t = 3 s, Stream A displays the car braking; Stream B shows the pedestrian stepping onto the crosswalk.

Now the forward condition is satisfied—both streams provide the required cues—so the model outputs “Yes, the car stops before the pedestrian crosses.”

Forward questions force the model to keep a running hypothesis and only answer when all streams jointly satisfy the condition, exposing any lag in cross‑stream synchronization.

**Fig. 3:** The illustration of the 4 multi-stream abilities. To evaluate these abilities, our X-Streaming Benchmark includes 3 progressive dimensions and 11 subtasks.

Data Generation Pipeline

We detail the end‑to‑end data pipeline that creates high‑quality multi‑stream QA pairs.

The pipeline turns raw video into a curated set of timestamped QA pairs that require genuine cross‑stream reasoning.

Chunk A contains frames 0‑5; Chunk B contains frames 6‑11.

We generate a question “What object appears at second 3?” anchored to frame 6 (the first frame of Chunk B).

Sufficiency test feeds both chunks to the model; it answers correctly because the object is visible in Chunk B.

Necessity test feeds only Chunk A; the model fails because the relevant frame is missing.

This toy example shows how the pipeline enforces that a correct answer truly depends on the designated timestamp and on the presence of all required streams.

Resample raw video to 2 FPS and split into <50 MB segments.

Generate timestamped QA pairs via MLLM generation plus template refinement.

Apply dual verification: sufficiency (all streams) and necessity (any single stream).

Conduct two‑round expert review, editing or discarding flawed samples.

Algorithm 1: X‑streams Benchmark Pipeline

**Algorithm 1: X-streams Benchmark Pipeline** `CODEBLOCK_0`

**Table 1.** Statistics of multi-domain video/data sources used in our study. “$\sim$” means “approximately”. Note: one “take” consists of multiple videos, while an individual video may be reused across multiple “takes”.

Multiplexing Strategies

We detail three multiplexing strategies that let MLLMs ingest multiple video streams within a single token sequence.

MLLMs accept only a single token stream, so feeding several video feeds requires a trick to merge them without exceeding the model’s context window.

Multiplexing stitches multiple video streams into one token sequence, letting a single‑stream MLLM process them as if they were one continuous signal.

How is this different from simply concatenating all frames into one long video?

Plain concatenation mixes the streams in a single spatial plane, destroying the orthogonal cues that let the model later disentangle them. Multiplexing preserves a structured separation (e.g., distinct spatial layout or temporal tags) while still fitting inside the token budget.

Spatial division lays streams side‑by‑side in the image plane; temporal division alternates frames from different streams in time, each tagged with a stream identifier.

Downsample $M_t$: keep the top‑left 2×2 block → 4 pixels.

Downsample $N_t$ similarly → 4 pixels.

Concatenate along width → an 2×4 image (8 pixels total).

Tokenize each pixel → 8 tokens, which satisfies $C_{\text{max}}=10$.

Spatial division trades visual fidelity for a single‑frame representation; the model can still attend across streams because the concatenation preserves a clear left/right boundary.

Why does spatial division need video re‑encoding while temporal division does not?

Spatial division creates a new composite image that the visual encoder must process, incurring an extra encoding pass. Temporal division merely tags existing frames, so the encoder can reuse the original per‑stream encoding pipeline.

Instead of mixing raw pixels, we prune each stream to its most informative tokens and then interleave those tokens.

Compute pairwise similarities (assume high similarity between the first two M tokens, low elsewhere).

Form DPP kernel $K$ and evaluate marginal gains.

Select M token 1 (relevance 0.9) – highest gain.

Penalize M token 2 due to similarity with token 1; marginal gain drops below M token 3.

Select M token 3 (relevance 0.3) as second M token.

Select N token 1 (relevance 0.8) as the sole N token.

The greedy MAP picks a diverse subset: the two M tokens are not redundant, and the N token adds complementary information.

How does the DPP‑based selector differ from a naïve top‑$k$ ranking?

Top‑$k$ would simply pick the highest relevance scores, possibly choosing several nearly identical tokens. The DPP multiplies relevance by a similarity penalty, so a token that is too similar to an already‑chosen one loses marginal gain, ensuring the final set is both relevant and diverse.

**Fig. 5:** MLLMs can only handle one token stream at a time, making a multiplexer essential for integrating multiple video streams into one token stream. To address this, we investigate three multiplexing strategies and uncover their inherent trade-offs. During evaluation, the model sequentially processes continuous video streams in 1-second intervals while maintaining a sliding memory window for context management.

**Fig. D14:** The difference in Grid Raster-scan order causes a performance gap. In general, raster-scanning can be understood as a left-to-right, top-to-bottom process. For horizontal concatenation, scanning typically results in tokens from multiple streams being interleaved within the same frame. Conversely, vertical concatenation generally prevents tokens from multiple streams from interleaving during scanning. Regardless of the method used, however, a certain amount of overlapping tokens is inevitable.

Experimental Results

Key results show MLLMs lag far behind humans on multi‑stream video understanding.

Current MLLMs achieve only around 50 % accuracy on the X‑Stream benchmark, far below the ≈ 90 % human baseline.

Table 2 shows proprietary models at ~70 % overall while the best open‑source model lags at ~55 %; human preference rows sit near 90 %.

**Fig. 6:** The case study in our X-Stream Benchmark. We choose a 4-stream, proactive, free-form QA (yellow) and a 2-stream, proactive, multi-choice QA (green) as examples.

Multiplexing scheme analysis reveals that spatial and time division degrade severely as stream count grows, whereas semantic division preserves core meaning under high‑density conditions.

Advantages: Spatial Division excels at temporal modeling and cross‑stream referencing; Time Division shines under relaxed token budgets and in dual‑stream scenarios; Semantic Division dominates when token constraints are tight and stream count ≥ 3.

Audio‑capable multiplexing ablation shows that spatial division causes multi‑channel audio overlap, while time division eliminates overlap but introduces semantic discontinuities in speech.

Single‑stream inference collapses on the X‑Stream benchmark; injecting distracting streams into a single‑stream dataset leads to severe degradation, confirming the necessity of genuine multi‑stream processing.

Discussion: public video datasets lack the precision and synchronization needed for robust multi‑stream training, and existing multiplexing strategies still trade off video comprehension against temporal reasoning.

Conclusion: X‑Stream establishes a rigorous dual‑verification benchmark; current MLLMs reach only ~50 % accuracy and struggle with proactive tasks, while the trade‑offs among spatial, time, and semantic division guide future architecture design.

Supplementary Material

Supplementary code release and detailed appendix organization.

We release a preview version of the evaluation code in the attached compressed file. We also provide additional information for the reader’s reference, including data sources, data analysis, data previews, and empirical observations.

X‑Stream is presented as the first benchmark for multi‑stream understanding, offering a comprehensive framework to measure perception, understanding, and reasoning across multiple streams.

The appendix is organized as follows: A) Multi‑Stream Data Preview and details of the X‑Stream benchmark; B) Data Sources, including a taxonomy of cross‑stream reasoning tasks, scenario‑specific QA tasks, dataset statistics, annotator details, annotation protocol, correction statistics, and licensing information; C) Experiment Detail; D) Additional observations and analyses covering single‑stream shortcut cases, grid layout and spatial division, temporal embedding for time division, the evaluation prompt for LLM‑as‑a‑Judge, and qualitative QA examples.

Benchmark Details

Detailed data sources, taxonomy, and scenario‑specific QA tasks for the X‑Stream benchmark.

Section B.1 enumerates the raw video material used to build X‑Stream. After aggregating roughly 857 hours from 20 distinct sources, a strict screening pipeline retained about 160 hours, as summarized in Tables B8 and B9.

Table B8 lists each domain (e.g., Driving, Sports, Robot) together with the concrete datasets or public collections that supplied the footage.

Table B9 breaks down the retained material by source, reporting the number of takes, shot hours, and total video hours; the grand total is 160.30 hours across 451 takes.

Live‑Streaming Data focuses on reaction videos where top‑10 streamers (iShowSpeed, Kai Cenat, Tyler1, etc.) comment on external video or gameplay content; clips are limited to 5–30 minutes to keep samples comparable.

Multi‑Stream Game Data gathers gameplay‑only recordings for two game families—competitive esports titles (CS 2, Mario Kart 8, League of Legends) and other popular games (A Way Out, It Takes Two, Split Fiction). When public multi‑view footage is unavailable, the authors synthesize additional viewpoints using the Source Engine’s HLAE controller and OBS capture.

Car View with Dashboard constructs a paired video‑telemetry set from the comma2k19 driving dataset. CAN speed timestamps serve as the reference axis; missing control signals (throttle, brake, gear, RPM) are analytically inferred from longitudinal speed and acceleration.

Map and Street View creates synchronized street‑view and map videos by sampling random origin‑destination pairs, retrieving panoramas and static maps for each waypoint, and discarding any mismatched frames to guarantee a strict one‑to‑one alignment.

Section B.2 introduces a taxonomy of cross‑stream reasoning tasks (Table B10) grouped into four families: Cross‑stream Interference, Multi‑stream Cooperation, Cross‑stream Reference, and Single‑stream Understanding, each with two concrete sub‑tasks.

Table B10 details the logical pattern of each sub‑task (e.g., Noise Filtering, Complementary Reasoning, Cross‑view Localization) and how information from multiple streams is combined, contrasted, or linked to produce an answer.

Section B.3 defines scenario‑specific QA tasks (Table B11) that map real‑world applications to the taxonomy. The three scenario groups cover different angles of the same object, different views of the same behavior, and different devices for the same goal.

Each scenario lists a primary task (e.g., manipulation failure diagnosis, causal driving explanation, geo‑localization) together with an illustrative cross‑stream question that demonstrates why multi‑stream reasoning is essential.

Dataset Statistics

Appendix details dataset stats, annotation process, and additional analyses.

The original X‑Stream collection was gathered to maximize distributional breadth and preserve real‑world diversity before any cleaning.

After filtering, the final dataset presents a balanced mix of categories, supporting both multiple‑choice and open‑ended answer formats for comprehensive evaluation.

**Fig. A7:** Data Preview. This preview highlights the main real-world multi-stream applications and offers an overview of the diversity of our X-Stream.

Human annotation was performed by 31 domain experts, each compensated at \$18 per hour.

Annotators followed a verification‑and‑correction pipeline: they inspected the synchronized multi‑stream clip, judged question validity, checked answer correctness, and either marked “No Error” or selected a specific error type for correction.

Overall, 25.6% of evaluation instances required manual correction; after fixing these, the dataset achieved 94.5% accuracy.

Among the corrected cases, question errors dominated (65.2%), followed by answer errors (21.7%) and timestamp errors (13.0%).

**Table B9.** Statistics of video source in X-Stream.

The table lists various data sources and their corresponding metrics: Takes Count, Shot Hours, and Video Hours.

Licensing review found CC BY 4.0 and Apache 2.0 to be the most common open‑source terms; X‑Stream adopts these licenses per source.

Token budgets differ across models: Gemini emits a fixed 263 tokens / sec, Qwen3‑VL uses 28×28‑pixel patches with merging, and GPT‑5 allocates 85 tokens / frame plus 170 tokens per 512×512 tile.

To stay within model limits, videos are resized so that GPT respects $C_{\text{max}} = 250$ with a maximum edge of 512 pixels; Qwen caps at $511\times383$, while Gemini’s limit is irrelevant to resolution.

Section D aggregates extra observations that did not fit the main experimental narrative.

D.1 identifies “shortcut” cases where a model can answer temporal questions using stable global context or redundant cues, without truly grounding events across streams.

D.2 compares spatial‑division stitching strategies: vertical‑level stitching keeps streams more separable, while horizontal stitching interleaves tokens from different streams at the same timestep, leading to poorer global coherence.

The advantage of vertical stitching is attributed to raster‑order traversal (row‑major flattening) preserving temporal ordering while minimizing cross‑stream token mixing.

D.3 shows that assigning identical timestamps to tokens from different streams collapses temporal distinction and causes up to a 30% performance drop.

Consequently, the authors enforce continuous timestamps across streams, which restores the model’s ability to differentiate moments.

**Fig. B8:** Distribution of the original data (before processing).

E.1 presents the LLM‑as‑a‑Judge prompt used for automatic evaluation, with a 0–5 scoring rubric and JSON output requirement.

The rubric rewards coherent, factually grounded explanations and penalizes factual errors or missing causal links.

**Table B10.** Taxonomy of Cross-stream Tasks and Core Logic. We organize task types according to the four core capabilities defined in X-Stream, based on how information from different streams contributes to the final answer.

**Table B11.** Representative settings given to LLMs as few-shot learners.

Read the original paper

Open the simplified reader on Paperglide