Streaming Communication in Multi-Agent Reasoning

Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

Streaming reasoning steps between agents improves both latency and accuracy by filtering out error-prone late-stage reasoning.

Can we reduce multi-agent reasoning latency by streaming individual reasoning steps between agents instead of waiting for full agent responses?

Multi-agent reasoning systems typically use a "generate-then-transfer" protocol, forcing downstream agents to wait for the entire upstream response. This serial bottleneck increases latency and forces agents to condition on the full reasoning chain, including error-prone late steps. STREAMMA replaces this with a streaming protocol that forwards each reasoning step as soon as it is generated. This pipelining reduces idle time and allows downstream agents to begin reasoning from a reliable prefix, diluting the impact of degraded tail steps. Across eight benchmarks, this approach consistently outperforms serial baselines, achieving an average accuracy gain of +7.3 percentage points on Claude Opus 4.6.

Paper Primer

The core mechanism hinges on the non-uniform quality of multi-step reasoning: early steps are generally reliable, while later steps often degrade. By streaming steps, the system acts like a filter, letting downstream agents build their own trajectory from the reliable head before the unreliable tail arrives to confuse the context.

Streaming reasoning steps improves task-level accuracy over serial baselines.

Across eight benchmarks, STREAMMA outperformed the serial protocol by an average of +7.3 percentage points on Claude Opus 4.6 and +1.5 percentage points on GPT-5.4. Peak improvement reached +22.4 percentage points on the HMMT 2026 benchmark.

Increasing per-agent steps (S) provides a new, orthogonal scaling dimension.

At a fixed agent count of 64, increasing per-agent steps from the default to 64 improved accuracy by +5.3 percentage points while achieving a 26.9× wall-clock speedup. The speedup reached 83% of the theoretical upper bound derived in the paper.

Why does receiving less information (partial steps) lead to better reasoning than receiving the full response?

In complex reasoning, late-stage steps often degrade in quality. The serial protocol forces downstream agents to condition on these error-prone steps, whereas streaming allows them to prioritize the reliable early steps, effectively diluting the impact of later mistakes.

Is this streaming protocol universally better than the standard serial approach?

No. The paper’s Theorem 1 identifies six regimes; streaming is provably optimal only when the reasoning chain exhibits a "head-strong, tail-weak" correctness pattern. In cases where the entire chain is highly reliable, the serial protocol remains superior.

Practitioners should treat reasoning-step granularity as a tunable design axis. For tasks with declining step-level reliability, streaming is a drop-in replacement that simultaneously accelerates inference and boosts accuracy.

Introduction and Motivation

Multi‑agent pipelines stall on full‑response transfers, inflating latency.

Existing multi‑agent systems adopt a “generate‑then‑transfer” (Serial) protocol: an upstream agent must finish its entire response before any downstream agent can begin, causing latency to grow linearly with pipeline depth.

Agents exchange whole responses only after the sender has completed its reasoning, so the receiver sits idle until the full message arrives.

STREAMMA replaces this Serial protocol with Stream: each reasoning step is transmitted immediately, allowing downstream agents to start work while upstream agents are still generating later steps.

Because early steps tend to be reliable and later steps degrade, Stream lets downstream agents base their reasoning on the high‑quality prefix, avoiding contamination from noisy tail steps; the downstream can still incorporate later steps if they prove useful.

**Figure 1.** Communication protocols. (a) Serial: the downstream agent receives the upstream agent's complete response before execution. (b) Stream: the downstream agent receives each upstream reasoning step as it is generated, enabling pipelined execution.

Our closed‑form analysis (Theorems 1‑3) quantifies these gains: Stream dominates Serial in effectiveness when early steps are trustworthy, achieves a theoretical speedup bound given by the ratio of overlapping work, and incurs a modest cost ratio under typical LLM serving conditions.

Empirically, STREAMMA improves average effectiveness by +7.3 percentage points over Serial across eight benchmarks, with a peak gain of +22.4 pp on HMMT 2026, while also cutting wall‑clock latency.

Beyond the protocol, we uncover a step‑level scaling law: increasing the number of reasoning steps per agent consistently boosts both effectiveness and speedup, providing a new axis orthogonal to the classic agent‑count scaling.

Latency bottlenecks in multi‑agent pipelines stem from waiting for full responses; streaming steps eliminates this wait.

Related Work

We situate STREAMMA among prior multi‑agent and streaming approaches, highlighting the novel step‑level design axis.

Multi‑Agent Reasoning and Communication have become a mainstream paradigm for complex LLM tasks, with prior work advancing along three orthogonal axes: communication topology, the content exchanged (e.g., intermediate rationales or KV‑cache representations), and the scale of agents.

All of these approaches share a generate‑then‑transfer assumption: an upstream agent must finalize its response before any downstream agent can act, eliminating pipeline parallelism. We instead refine the granularity of communication from full responses to individual reasoning steps, introducing a new design axis—per‑agent step count—that is orthogonal to agent‑count scaling.

Step‑by‑step reasoning is now standard, and recent work shows that step quality is position‑dependent: accuracy peaks at an optimal length and degrades beyond it. Our Theorem 1 leverages this property to derive a closed‑form effectiveness ordering of Single, Serial, and Stream modes, identifying conditions under which streaming is provably optimal.

Pipeline parallelism has long been used for distributed training, and streaming inference for LLMs has been pursued at the intra‑agent level via speculative decoding, Group Think, and Multi‑Stream LLMs. At the inter‑agent level, prior work includes skeleton expansion, speculative agent actions, and staircase streaming, but these methods treat streaming solely as a speedup mechanism, with effectiveness gains only incidental. STREAMMA operates at the reasoning‑step level over arbitrary multi‑agent DAGs—coarser than token‑level streaming yet finer than pre‑decomposed skeletons—and is the first to demonstrate simultaneous improvements in effectiveness and latency.

The STREAMMA Mechanism

We expose the streaming trick that eliminates waiting, then prove why it improves correctness and latency.

The bottleneck in multi‑agent pipelines is the blocking call that forces each agent to wait for the full response of its predecessor before it can start.

Instead of waiting for an entire answer, each agent streams individual reasoning steps as soon as they are produced, letting downstream agents start work immediately.

How does STREAMMA differ from the classic generate‑then‑transfer pipeline?

Generate‑then‑transfer sends a complete LLM output as a single message, incurring a full‑response wait. STREAMMA emits each step as soon as it is ready and forwards it immediately, turning the pipeline into a true producer‑consumer stream.

Agent 1 generates $s_1^{(1)}$ and puts it into queue 2.

Agent 2 reads $s_1^{(1)}$, appends it to its context, and begins generating $s_1^{(2)}$.

While Agent 2 works on $s_1^{(2)}$, Agent 1 produces $s_2^{(1)}$ and pushes it to queue 2.

Agent 2 now has $s_1^{(1)}$ and $s_2^{(1)}$ in its context while still computing $s_1^{(2)}$, illustrating overlapping computation.

After all three steps of Agent 1 are streamed, Agent 2 finishes its three steps, completing the chain in $3+1$ time units instead of $6$.

The overlap reduces the critical path from $A\cdot S$ to $S + A - 1$, which is the theoretical speedup bound.

In the serial baseline each agent blocks until its predecessor returns a full response, so agents run one after another.

Why does the serial baseline still appear in evaluations if it is clearly slower?

It serves as a controlled reference that isolates the effect of streaming; any improvement can be attributed to the removal of the blocking wait rather than to changes in model prompts or architecture.

Each reasoning step $j$ is correct with probability $p_j$; correctness propagates downstream, raising or lowering the quality of later steps.

What does the threshold $p^{*}$ represent in practice?

It is the break‑even correctness probability: above $p^{*}$ a step is more likely to help than hurt downstream reasoning, below it the step is expected to degrade later performance.

Let $\bar p$, $p_{\text{head}}$, $p_{\text{tail}}$ be the uniform, head‑weighted, and tail‑weighted means of $\{p_j\}$, and let $p^{*}$ be the break‑even threshold. The mean step‑level correctness $s_{\text{Corr}}^{\text{mode}}$ for mode $\in\{\text{stream},\text{serial},\text{single}\}$ obeys the six case ordering described in the paper.

**Theorem 1 (Effectiveness Ordering).** Depending on how $\bar{p}$, $p_{\text{head}}$, $p_{\text{tail}}$ (uniform, head-weighted, and tail-weighted step-correctness means) compare to $p^*$, the sCorr ordering among three modes falls into six cases: (I) Stream advantage [$p_{\text{head}} > p^*$ and $p_{\text{tail}} < p^*$]: (a) If $\bar{p} > p^*$: $\text{sCorr}^{\text{stream}} > \text{sCorr}^{\text{serial}} > \text{sCorr}^{\text{single}}$ (b) If $\bar{p} < p^*$: $\text{sCorr}^{\text{stream}} > \text{sCorr}^{\text{single}} > \text{sCorr}^{\text{serial}}$ (II) Serial advantage [$\bar{p} > p^*$ and $p_{\text{tail}} > p^*$]: (a) If $p_{\text{head}} > p^*$: $\text{sCorr}^{\text{serial}} > \text{sCorr}^{\text{stream}} > \text{sCorr}^{\text{single}}$ (b) If $p_{\text{head}} < p^*$: $\text{sCorr}^{\text{serial}} > \text{sCorr}^{\text{single}} > \text{sCorr}^{\text{stream}}$ (III) Single advantage [$p_{\text{head}} < p^*$ and $\bar{p} < p^*$]: (a) If $p_{\text{tail}} < p^*$: $\text{sCorr}^{\text{single}} > \text{sCorr}^{\text{stream}} > \text{sCorr}^{\text{serial}}$ (b) If $p_{\text{tail}} > p^*$: $\text{sCorr}^{\text{single}} > \text{sCorr}^{\text{serial}} > \text{sCorr}^{\text{stream}}$

Intuitively, streaming shines when early steps are reliable (high $p_{\text{head}}$) but later steps degrade (low $p_{\text{tail}}$); serial excels when all steps are uniformly reliable; single‑agent wins when every upstream step is harmful.

**Algorithm 1** SERIAL EXECUTION **Require:** $Q$; $(\text{Agent}^a, \text{ctx}_a, \text{queue}_a)_{a=1}^A$ - $Q$: query - $\text{ctx}_a$: per-agent context - $\text{queue}_a$: FIFO queue - $\text{chain}$: $\text{Agent}^1 \to \dots \to \text{Agent}^A$ 1: $msg \leftarrow Q$ 2: **for** $a = 1$ to $A$ **do** 3: $\quad \text{ctx}_a.\text{append}(msg)$ 4: $\quad \triangleright$ wait; complete output 5: $\quad msg \leftarrow \text{LLM}(\text{ctx}_a)$ 6: **end for**

**Algorithm 2** STREAM EXECUTION 1: `queue_1`.put(Q) 2: ▷ all agents concurrent 3: for a = 1 to A in parallel do 4: while msg ← `queue_a`.get() do 5: `ctx_a`.append(msg) 6: ▷ yield step-by-step 7: steps ← LLM(`ctx_a`, stream=True) 8: for each step from steps do 9: if a &lt; A then 10: ▷ push; no wait 11: queue_{a+1}.put(step) 12: end if 13: ▷ KV cache reuse 14: `ctx_a`.append(step) 15: end for 16: end while 17: end for

Streaming steps removes the full‑response wait, yielding up to $A S/(S+A-1)$ latency speedup while preserving or improving step‑level correctness.

Effectiveness Analysis

STREAMMA consistently beats Serial in accuracy while keeping prompts unchanged.

STREAMMA outperforms Serial on every average‑accuracy cell across all benchmarks and topologies.

Table 1 shows higher Avg. scores for STREAMMA versus Serial on both Claude Opus 4.6 and GPT‑5.4 backbones. +7.3 pp average gain over Serial on Claude Opus 4.6

**Table 1.** Effectiveness across eight benchmarks and three topologies. STREAMMA (gray) vs. Single and Serial on Claude Opus 4.6 (high) and GPT-5.4 (medium). Avg. (%): unweighted mean; bold: higher in each {Topology, Method} pair. Single: one row per backbone, no topology axis. Each cell averages 3 runs (8 on AIME 2025/26 and HMMT 2026 due to small test sets). STREAMMA outperforms both baselines in every Avg. cell.

STREAMMA maintains or improves accuracy over Serial across all benchmarks.

Case Studies

We test how removing STREAMMA’s streaming step changes accuracy and error propagation.

Recall that STREAMMA streams each reasoning step to downstream agents, avoiding the latency of waiting for a full answer as Serial does. This section ablates the streaming component to see how accuracy degrades when only full outputs are passed.

**Figure 2.** Case study for Theorem 1. (a) Verdicts of Agent^1 ($\checkmark: p_j=1; \times: p_j=0$). (b) $\bar{p}/p_{head}/p_{tail}$ place this run in Case I.b, the Stream-advantage regime.

When the tail of Agent 1 is erroneous, STREAMMA gains up to +24.0 pp over Serial.

Figure 4 reports a +24.0 pp accuracy increase for the tail‑perturbed mask (Case I.b).

**Figure 3.** Step-Level Perturbation. Fixing Agent$^1$'s output, we perturb its steps and measure Agent$^2$'s accuracy; green / red mark Stream's gain / loss over Serial.

When the head of Agent 1 is corrupted, STREAMMA loses up to ‑36.0 pp compared to Serial.

Figure 4 shows a ‑36.0 pp drop for the head‑perturbed mask (Case II/III).

Efficiency and Scaling

Speedup scales with pipeline depth and step count, delivering up to 26.9× acceleration.

Measured speedup reaches 26.9× at the extreme A=64, S=64 configuration.

Four independent runs on HMMT 2026 show an average wall‑clock speedup of 26.9×, which is 83% of the theoretical bound 32.3×.

The empirical curve follows the theoretical AS / (S + A − 1) scaling, confirming Theorem 2; however, auto‑selected step counts fall far short because current LLMs do not increase S without explicit prompting.

When many agents run in a pipeline, each additional step adds work that can be overlapped, so overall latency shrinks roughly as AS / (S + A − 1).

Why doesn’t the speedup grow linearly with A × S?

Because the first and last pipeline stages cannot overlap; the denominator $S + A - 1$ subtracts that unavoidable sequential portion, so the gain is sub‑linear when the pipeline is already saturated.

**Figure 4.** Step-level scaling law. Left: speedup scaling in $S$; measured (solid) vs. theoretical maximum speedup $AS/(S+A-1)$ from Theorem 2 (dashed). Right: accuracy scaling in $S$, with avg marginals (main block boxed).

Stream×4 attains higher accuracy at roughly half the cost of Serial×16.

Cost‑accuracy measurements show Stream×4 ($\$2.75$, 90.9 %) versus Serial \times 16 ($\$5.46$, 89.4 %).

The Pareto frontiers in Figure 6 illustrate that streaming not only cuts cost but also preserves or improves accuracy; the shaded region shows how increasing cache hit rate shifts the frontier leftward without sacrificing performance.

**Figure 5.** Cost–accuracy Pareto frontiers. Each frontier tracks accuracy vs. cost as $N \in \{1, 4, 16\}$ chain replicas run in parallel and majority-vote on the final answer; larger $N$ trades higher compute for higher accuracy. Red shaded: KV-cache hit rate $h \in (0, 1)$, bounded by Stream $\times N$ at $h=0$ (solid) and $h=1$ (dashed).

Speedup scales with pipeline depth and step count, enabling orders‑of‑magnitude latency reductions.

Conclusion and Limitations

We outline STREAMMA’s scope, scaling insight, and when it fails.

Streaming reduces latency and improves effectiveness because the quality of early reasoning steps matters more than the sheer amount of context later on. This counter‑intuitive finding is supported by three closed‑form theorems that bound effectiveness, speedup, and cost.

Beyond the protocol, we discovered a step‑level scaling law: as the number of reasoning steps grows, effectiveness and speedup improve monotonically, providing a design dimension orthogonal to simply adding more agents.

STREAMMA assumes a task can be broken into discrete reasoning steps; chain‑of‑thought prompting satisfies this, but tasks that lack a natural step decomposition—such as open‑ended creative writing or single‑token classification—cannot benefit from the streaming protocol.

Theorem 1 identifies six step‑correctness regimes. Streaming is strictly advantageous only when agents exhibit head‑strong, tail‑weak correctness patterns; otherwise serial or single execution may match or exceed its performance. Practitioners can therefore consult the theorem to select the optimal execution protocol for a given workload.

Notation Summary

Provides a concise reference for all symbols used throughout the paper.

Table 2 lists the core symbols used throughout the work. $A$ denotes the number of agents and $S$ the number of reasoning steps each agent performs. $o_a$ is the number of output tokens produced by agent $a$ at step $s$, and $O_a\triangleq\sum_{s} o_a$ is the total output tokens of that agent.

$\bar{O}$ is the average per‑agent per‑step output token count, $P_a$ the length of agent $a$’s system prompt, and $\bar{P}$ the average prompt length across agents. $O_{\Sigma}= \sum_{a} O_a$ aggregates all output tokens across the entire multi‑agent system.

The effectiveness analysis defines $s_{\text{Corr, mode}}$ as the mean step‑level correctness for a given execution mode (stream, serial, or single). $p_j$ is the correctness probability of step $j$, while $\delta$ and $\varepsilon$ capture the expected downstream gain and drop respectively when an upstream step is correct or incorrect.

The expected change contributed by step $j$ is $\mu_j = p_j\delta-(1-p_j)\varepsilon$, and $p^{*}= \frac{\varepsilon}{\delta+\varepsilon}$ is the minimum step correctness required for a downstream context to be beneficial. $\bar{p}$ is defined as the overall mean step correctness (set to 1 in the paper), while $p_{\text{tail}}$ and $p_{\text{head}}$ weight correctness toward later and earlier steps respectively.

Speedup analysis introduces $C_a$ as the total number of context tokens in the $s$‑th call of agent $a$, and $h_a$ as the KV‑cache hit rate for that call. $\alpha$ and $\beta$ are the average numbers of non‑cached prefill tokens and cache‑hit tokens per output token, respectively.

The token‑processing speeds satisfy $v_c \gg v_p \gg v_d$, where $v_d$, $v_p$, and $v_c$ are the decode, prefill, and cache‑read speeds. Corresponding ratios are $r_{vdp}= \frac{v_d}{v_p}$ (decode‑to‑prefill), $r_{vdc}= \frac{v_d}{v_c}$ (decode‑to‑cache‑read), and $r_{po}= \frac{\bar{P}}{\bar{O}}$ (prompt‑to‑per‑step‑output).

Cost analysis mirrors the speed hierarchy: per‑token prices satisfy $c_d \gg c_p \gg c_c$, with $c_d$, $c_p$, and $c_c$ denoting decode, prefill, and cache costs. Price ratios are $r_{ccp}= \frac{c_c}{c_p}$ (cache‑to‑prefill) and $r_{cpd}= \frac{c_p}{c_d}$ (prefill‑to‑decode).

The overall token‑count reduction achieved by streaming versus serial execution is captured by $\rho = \frac{O_{\text{stream}}}{O_{\text{serial}}}$.

Section A.2 contains the detailed derivations underlying the effectiveness analysis introduced above.

Effectiveness Theory Details

Formal ordering of step‑level correctness for Stream, Serial, and Single modes.

The quantity $s_{\text{Corr}}^{\text{mode}}$ measures the average probability that a downstream agent’s reasoning step is correct when operating under a given execution mode (Stream, Serial, or Single).

Theorem 1 states that, depending on how the head probability $p_{\text{head}}$, the mean $\bar p$, and the tail probability $p_{\text{tail}}$ compare to $p^{*}$, the ordering of $s_{\text{Corr}}$ among Stream, Serial, and Single falls into six distinct cases (see Fig. 7).

**Figure 6.** Six canonical step-correctness profiles $p_j$, $1 \leq j \leq S$ (solid lines) relative to the breaven threshold $p^*$ (dashed line), corresponding to the six cases of Theorem 1, organized into three advantage regimes (columns). Left column (Stream-advantage): Case I.a (top), Case I.b (bottom); middle column (Serial-advantage): Case II.a (top), Case II.b (bottom); right column (Single-advantage): Case III.a (top), Case III.b (bottom).

These three expressions determine the sign of each ordering case: positive signs correspond to the conditions listed in Theorem 1 (e.g., $p_{\text{head}}>p^{*}$ makes the first expression positive, favoring Stream).

Practical scenarios map directly onto the six cases. When early steps are reliable but later steps degrade (Case I), streaming lets the downstream agent act on the strong prefix before errors accumulate, yielding a Stream advantage. When all steps are reliable (Case II), Serial benefits from the full context. When most steps are unreliable (Case III), avoiding upstream context altogether (Single) is optimal.

**Table.** Summary of notation used in the paper, categorized into General parameters, Effectiveness analysis, Speedup analysis, and Cost analysis.

Case Study Supplement

We illustrate Theorem 1’s head‑strong/tail‑weak regime with a concrete GPQA‑Diamond example.

We instantiate the head‑strong/tail‑weak sub‑regime of Theorem 1 on a GPQA‑Diamond question (ID #98) to see how streaming versus serial reasoning behaves in practice.

The question presents FTIR data (broad 3000 cm⁻¹ peak, strong 1700 cm⁻¹ peak) and a ¹H NMR spectrum with a doublet‑of‑triplets‑of‑quartets and another doublet‑of‑triplets‑of‑triplets, asking which of four carboxylic‑acid candidates matches.

We run a two‑agent chain (Agent 1 → Agent 2) using GPT‑5.4 with $S=8$ steps; each step of Agent 1 receives a binary correctness score $p_j\in\{0,1\}$ from an LLM‑as‑judge for readability.

In the Stream protocol, Agent 1 splits its answer into eight incremental steps; Agent 2 checks each step as it arrives, marking early steps correct, flagging a vague “ethyl/methylene side” phrasing at step 6, and rejecting a non‑physical long‑range coupling claim at step 7, before finally outputting answer B.

Agent 2’s per‑step quality check confirms steps 1‑5, flags step 6, rejects step 7, and after re‑enumerating proton neighbours for all options correctly selects B as the answer.

In the Serial protocol, Agent 1 emits the full eight‑step reasoning at once; Agent 2 sees the entire chain, flags the same long‑range‑coupling phrasing, but then re‑derives a DTT pattern for candidate D and accepts the earlier elimination of B, ending with the wrong answer D.

Both agents make the same pivotal error—eliminating the correct answer B without verifying its CH₃ neighbour—so early steps are reliable ($p_{\text{head}}>p^*$) while later steps are harmful ($\bar p<p^*$), exactly the pattern described in Theorem 1 sub‑regime I.b, where Stream outperforms Serial.

**Table.** Modes of prediction and their corresponding conditional probability conditions, where $\mathbb{P}(A_s^{a+1} | \cdot)$ is the conditional probability that $A_s^{a+1}$ is correct.

Speedup Theory Details

Provides perturbation trajectories and derives the theoretical latency speedup bound for streaming agents.

This appendix supplies two self‑contained pieces: a controlled perturbation experiment used in the case‑study supplement, and a full derivation of the latency speedup bound for the streaming protocol.

Section A.4 constructs two parallel 4‑step trajectories for the Agent1→Agent2 chain: a clean trajectory that yields the gold answer B and a perturbed trajectory that steers toward the distractor C.

A 4‑bit mask $m\in\{0,1\}^4$ selects the clean step when $m_j=1$ and the perturbed step when $m_j=0$; Agent2 then runs under either the Serial or the Stream protocol on this fixed output.

Both trajectories are presented side‑by‑side for each reasoning step, allowing us to inject false premises while preserving internal logical consistency.

Step 1 clean: FTIR shows a broad 3000 cm⁻¹ OH band and a strong 1700 cm⁻¹ carbonyl band, which do not discriminate among the four saturated carboxylic acids. The decisive evidence are two complex methine multiplets—DTQ (doublet × triplet × quartet) and DTT (doublet × triplet × triplet)—indicating two non‑equivalent CH protons each coupled to three distinct neighbour groups.

Step 1 perturbed: a full 2D‑NMR suite (HSQC, HMBC, NOESY) establishes that the $\alpha$‑carbon bears exactly one hydrogen and is directly bonded to the COOH, eliminating candidates B and D and leaving A and C as survivors.

Step 2 clean: decoding the splitting trees shows DTQ requires a methine bonded to a CH₃, whereas DTT requires a methine bonded to two inequivalent CH₂ groups and no CH₃.

Step 2 perturbed: J‑resolved 2D‑NMR reads a quartet component $J_3=1.14$ Hz, which is characteristic of a long‑range 4J W‑type coupling to a methyl two bonds away, contradicting the naive vicinal‑CH₃ expectation.

Read the original paper

Open the simplified reader on Paperglide