Audio Interaction Model

A unified streaming audio-language model that enables real-time, comprehension-grounded interaction.

How can we unify disparate, offline audio tasks into a single, streaming-native model that processes audio continuously and decides when to respond?

Current Large Audio Language Models (LALMs) operate offline, requiring a complete audio clip before generating a response. This design fails to capture the continuous, interactive nature of real-world audio, where humans expect systems to listen, decide, and respond in real time. The authors introduce Audio-Interaction, a model that consumes audio in fixed-length chunks and uses a comprehension-grounded decision mechanism to determine whether to remain silent or respond at each step. This perceive-decide-respond loop is supported by the SoundFlow framework, which handles streaming-native data construction and asynchronous inference. Across eight benchmarks, the model preserves competitive performance on standard audio tasks while unlocking new capabilities like proactive assistance and real-time instruction following.

Paper Primer

The core mechanism is a chunk-level sequential decision process: the model acts as a gatekeeper that evaluates each 400ms audio chunk to emit a special token—either <silent> or <response>—before proceeding. This is like a traffic controller who monitors a continuous stream of vehicles and only signals for a stop when the pattern demands an intervention, rather than waiting for the entire road to clear.

Audio-Interaction achieves state-of-the-art performance on general audio understanding while enabling streaming interaction.

The model scores 58.15 on the MMAU benchmark, matching or exceeding offline baselines. It retains over 91% of its single-segment accuracy even when concatenating five-turn streams, where offline baselines collapse by more than 30%.

The asynchronous inference scheme significantly reduces latency.

By decoupling encoding from decoding via FIFO scheduling, the system eliminates inference stalling. First-frame latency is reduced by 4.5x compared to synchronous alternatives.

Why is a new framework needed instead of just streaming the input to existing LALMs?

Existing LALMs are designed for offline input-output mappings and lack the ability to decide *when* to respond based on unfolding context. Simply streaming audio to them does not solve the problem of false triggering or the inability to maintain long-range context across chunked inputs.

What is the scope of the Proactive-Sound-Bench?

It evaluates the model's ability to provide proactive assistance without explicit instructions, using 644 human-designed acoustic events that require the model to correctly trigger or abstain based on environmental context.

Introduction: The Need for Streaming Interaction

Audio‑Interaction unifies offline LALMs into a streaming, always‑on model.

Audio is a continuous, always‑on channel, yet today’s Large Audio Language Models (LALMs) operate offline and are tied to a single task such as streaming ASR or voice chatting.

Two fundamental gaps emerge: (C1) an interactive model must decide, at each audio chunk, whether to respond based on semantic understanding—not just acoustic cues; (C2) chunked inference breaks temporal continuity, forcing the model to reconstruct long‑range context without inflating latency.

A streaming model that continuously listens, decides whether to stay silent or speak, and produces output—all within a single perceive–decide–respond loop.

We realize this regime with the SOUNDFLOW framework, which (i) synthesizes interaction data via hierarchical event curation and a time‑frequency joint preprocessing module, (ii) trains with a chunk‑level sequential decision objective that mitigates context forgetting, and (iii) deploys FIFO‑scheduled asynchronous inference that decouples encoding from decoding, cutting first‑frame latency by 4.5×.

To train and evaluate this paradigm we build STREAMAUDIO‑2M, a 2.6 M‑item corpus covering seven core abilities and 28 sub‑tasks, and we release PROACTIVE‑SOUND‑BENCH, a benchmark of 644 human‑designed events probing proactive audio assistance.

**Figure 1.** AUDIO-INTERACTION listens to a continuous audio stream and decides at each moment whether to stay silent or speak, unifying conventional capabilities (e.g., dialogue, ASR) and streaming-native (e.g., simultaneous translation, proactive help) capabilities within a single model.

**Figure 2.** Human listening is a continuous activity. We take in sound moment by moment and judge for ourselves when a reaction is called for. Current audio models work the opposite way: they wait for a finished recording, answer once, and handle only one kind of task per system. AUDIO-INTERACTION closes this gap by processing sound as it arrives and judging, step by step, when to speak and when to hold back—letting one model cover what previously took many specialized ones.

The shift from offline, task‑specific audio models to a unified, streaming‑native interaction paradigm enables real‑time, multi‑capability behavior without sacrificing existing performance.

Data Curation and Training Sources

This section details the StreamAudio‑2M dataset construction and defines offline LALM baselines.

We first introduce the StreamAudio‑2M dataset, then describe the offline LALM baselines used for comparison.

StreamAudio‑2M pairs long‑form, real‑world‑simulated audio with token‑level annotations, covering seven core capabilities across 28 sub‑tasks.

**Figure 5.** STREAMAUDIO-2M is a dataset built for streaming audio interaction, pairing long-form, real-world-simulated audio with token-level annotations. It jointly trains the model to interact in real time grounded in context while covering 7 foundational capabilities across 28 sub-tasks.

Standard large audio language models that combine a Whisper‑style encoder, an adapter, and a language‑model backbone, requiring the full audio clip before generating a response.

**Figure 6.** Statistics of StreamAudio-2M. (a) The capability taxonomy spans seven core capabilities of a streaming audio model. (b) Round distribution, average response tokens, and silence proportion across tasks. (c) Statistics of source data.

**Algorithm 2** Streaming Sample Tokenization and Label Construction **Require:** instruction tokens $\mathcal{A}^{\text{ins}}$, audio chunks $a_{1:T}$, response timeline $\mathcal{R} = [(t_k, r_k)]_{k=1}^K$ sorted by $t_k$ **Ensure:** token sequence $X$, streaming target $y^{\text{stream}}$, LM target $y^{\text{LM}}$ 1: $X, y^{\text{stream}}, y^{\text{LM}} \leftarrow [], [], []$ 2: Append $\mathcal{A}^{\text{ins}}$ to $X$; extend labels with MASK 3: $k \leftarrow 1$ 4: **for** $t = 1$ to $T$ **do** 5: $\quad$ Append encoder features of $a_t$ to $X$; extend labels with MASK 6: $\quad$ **if** $k \leq K \land t_k = t$ **then** $\quad \triangleright$ response triggers at chunk $t$ 7: $\quad \quad$ Append $\langle\text{response}\rangle$; $y^{\text{stream}} += \langle\text{response}\rangle$, $y^{\text{LM}} += \text{MASK}$ 8: $\quad \quad$ **for** token $w$ in $r_k$ **do** 9: $\quad \quad \quad$ Append $w$; $y^{\text{stream}} += \text{MASK}$, $y^{\text{LM}} += w$ 10: $\quad \quad$ **end for** 11: $\quad \quad$ Append $\langle\text{eos}\rangle$; $y^{\text{stream}} += \text{MASK}$, $y^{\text{LM}} += \langle\text{eos}\rangle$ 12: $\quad \quad k \leftarrow k + 1$ 13: $\quad$ **else** $\quad \triangleright$ remain silent 14: $\quad \quad$ Append $\langle\text{silent}\rangle$; $y^{\text{stream}} += \langle\text{silent}\rangle$, $y^{\text{LM}} += \text{MASK}$ 15: $\quad$ **end if** 16: **end for** 17: **return** $X, y^{\text{stream}}, y^{\text{LM}}$

The Streaming Training Paradigm

How the model learns to listen, decide, and speak continuously.

Training a model that must decide in real time when to speak faces two practical problems: (1) it can lose earlier context in long streams, and (2) it may fire a response to irrelevant sounds.

The model treats the audio stream as an endless sequence and decides at each moment whether to keep listening or to emit a response, like a listener who flips a switch from “watch” to “talk” once enough of the story is heard.

How does the Streaming‑Native Paradigm differ from the offline LALM approach used in prior work?

Offline LALMs wait for a complete audio clip before producing any output, whereas the streaming paradigm makes a binary decision after each fixed‑size chunk, allowing immediate continuation of listening or instant response generation.

Chunk 1 arrives → model predicts <silent> because evidence is insufficient.

Chunk 2 arrives → still not enough context → predicts <silent> again.

Chunk 3 arrives → accumulated cues cross the decision threshold → predicts <response>.

Model switches to generation mode and outputs the token “Hello”.

This walk‑through shows how the model defers speaking until enough acoustic evidence is gathered, avoiding premature interruptions.

At each step the model computes $dt, rt = f_{\text{det}}(a_t, C_t)$. If $dt = \langle\text{silent}\rangle$, $rt$ is empty and the stream continues; otherwise $rt = f_{\text{resp}}(a_t, C_t)$ generates the textual response.

Format training: use offline data to teach the model the sequence format and the use of <`Spe_token`> (samples of the form (`A_ins`, `A_in` → T)).

Adapter training: learn a mapping from chunk‑wise acoustic representations into the language‑model space while keeping the same format.

Large‑scale streaming supervised training: jointly optimize adapter and LM on core capabilities (audio understanding, ASR, spoken dialogue) with both instruction‑only and instruction‑plus‑audio inputs.

Instruction‑following fine‑tuning: expose the model to interleaved sequences that require comprehension‑aware intervention and proactive response, e.g., (`A_ins`, `A_in`, T, …) and (`A_in` → T).

**Figure 3.** The training framework of SOUNDFLOW. Audio signals, intermediate representations, and supervision signals are organized into a unified temporal sequence, and a streaming training strategy jointly optimizes language modeling and response triggering, enabling AUDIO-INTERACTION to decide when to respond or remain silent across diverse real-time tasks.

FIFO-Scheduled Asynchronous Inference

How the streaming dataset is assembled and how inference stays stable.

Real‑time audio encoding together with the model’s special‑token silence response can cause waiting conflicts: the decoder may be idle while the encoder keeps feeding data, or it may block because the last token signals “keep listening”. This scheduling tension leads to inference stalls and high first‑frame latency.

The encoder pushes acoustic embeddings into a first‑in‑first‑out queue, and the decoder only pulls from the head of that queue when the previously emitted token was a silence or end‑of‑sentence marker.

How does this differ from a naïve FIFO buffer that the decoder simply reads every step?

In the naïve case the decoder would consume whatever is at the queue head regardless of the semantic state of the conversation, potentially interrupting a spoken response. FIFO‑Scheduled Asynchronous Inference adds a token‑based gate: the decoder only pulls when the last token signals “silence” or “end‑of‑sentence”, preserving turn‑taking semantics while still benefitting from FIFO ordering.

Encoder encodes $x_1\rightarrow a_1$ and appends to $Q$: $Q=[a_1]$.

Encoder encodes $x_2\rightarrow a_2$ and appends: $Q=[a_1,a_2]$.

Decoder sees $r_{t-1}=\langle\text{silent}\rangle$, flushes $Q$, extending cache $\mathcal{C}$ with $[a_1,a_2]$ and consumes them.

Decoder generates token $r_t$ (e.g., “yes”). Queue is now empty.

Encoder processes $x_3\rightarrow a_3$, appends: $Q=[a_3]$.

Decoder’s next token $r_{t+1}$ is a regular word, so it does not flush; $Q$ retains $a_3$ for the next silence turn.

The gate ensures that the model only “listens” when it is not speaking, preventing self‑interruptions and guaranteeing low‑latency resumption after a response.

FIFO‑Scheduled Asynchronous Streaming Inference (simplified).

**Figure 4.** SoundFlow's FIFO-scheduled asynchronous streaming inference. Audio chunks are appended to temporal queue; decoding is triggered when decoder is not speaking.

Data Collection: aggregate ~1.64 M foundational items (~8,900 h) from public corpora (MOSS, CommonVoice, GigaSpeech, LibriSpeech, VoxPopuli, CoVoST2, AISHELL, FMA, AudioSet) and add ~171 k acoustic‑event clips plus diverse noise sources.

Preprocessing: synthesize speech from textual sources using multi‑voice CosyVoice, then verify each utterance via LLM rewriting and ASR checking to ensure fidelity.

Sequence Concatenation: stitch validated utterances into multi‑turn streaming sequences per Section 3.2, overlaying dual‑track environmental noise to simulate real‑world conditions.

Token‑level Annotation: convert each streaming sequence into ⟨input ids, labels⟩ pairs, marking special tokens (

Performance and Benchmarks

Key performance gains of AUDIO‑INTERACTION across streaming and speech tasks.

AUDIO‑INTERACTION preserves audio understanding under streaming training, achieving 58.15 % accuracy on audio‑instruction MMAU and surpassing the Qwen2.5‑Omni‑3B baseline.

Table 1 shows 58.15 % for AUDIO‑INTERACTION versus 57.53 % for Qwen2.5‑Omni‑3B, while matching several 7 B systems despite a smaller 3 B parameter count.

**Table 1.** Results on the MMAU benchmark under text and audio instructions across three audio domains. Stream. and Multi-turn indicate streaming and multi-turn training support(- indicates not applicable).

**Table 2.** Performance score ($\uparrow$) on four spoken-dialogue benchmarks.

**Table 4.** Results on the Proactive-Sound-Bench. Equip. stands for Equipment. Sin. and Mul. denote Single-round and Multi-round respectively. Best and second-best results are highlighted.

**Figure 9.** Capability stability of AUDIO-INTERACTION as the stream extends from 1 to 5 concatenated segments. We report MMAU average accuracy, dialogue accuracy, and end-to-end latency.

Structural Analysis of Streaming

We quantify how streaming bridges the offline gap and isolate the key attention head driving silent‑vs‑response decisions.

The streaming architecture unifies discrete audio chunks into a continuous representation and routes the silent‑vs‑response decision through a single attention head.

Continuity ratio at the early decoder layer rises to $0.80$, eliminating most of the fragmentation caused by independent chunk embeddings.

Figure 7 shows encoder output continuity at $0.25$ and a projector shift of less than $0.02$, while GPT Layer 0 lifts continuity to $0.80$ in a single step.

**Scenario 1. Home — Weekend Childcare** A continuous 30-second household stream. audio-interaction listens every 0.4 s and decides whether to remain silent or speak; five of the seven StreamAudio-2M task categories are exercised in this single scene.

**Scenario 2. Office — Workday** A continuous 60-second office stream. audio-interaction listens every 0.4 s and decides whether to remain silent or speak; this scene exercises five of the seven StreamAudio-2M task categories.

Ablation Study

Ablations quantify how each design choice impacts latency, accuracy, and overall interaction quality.

We run four ablations, each removing a single component of the AUDIO‑INTERACTION system to measure its impact on latency, stall rate, and downstream metrics.

FIFO scheduling cuts first‑chunk latency by more than half and eliminates stalls.

Table 5 shows the “OURS” configuration (with FIFO) achieves 392 ms latency and 0 % stall versus 831 ms and 5.2 % without FIFO.

**Table 5.** effect of Asynchronous Infer.

Removing TFJP preprocessing drops trigger accuracy by 7.1 percentage points.

Table 6 compares variant V2 (full streaming SFT) with V3 (w/o TFJP); trigger accuracy falls from 92.4 % to 85.3 %.

Omitting hierarchical event selection reduces trigger accuracy by an additional 3.9 points.

Table 6 shows variant V4 (w/o event selection) at 88.5 % versus 92.4 % for the full V2.

**Table 6.** Ablation on streaming model training.

A 0.4 s chunk yields 392 ms latency while preserving accuracy.

Table 7 reports 392 ms latency and MMAU ≈ 58.2 for the 0.4 s setting, outperforming both smaller (0.2 s) and larger (0.6–0.8 s) chunks.

**Table.** Configuration and training parameters for the four stages of the model.

Setting $\lambda$ = 1.0 yields the highest trigger accuracy (96.9 %).

Table 8 reports 96.9 % trigger accuracy at $\lambda$ = 1.0, compared with 95.3 % at $\lambda$ = 0.5 and 96.7 % at $\lambda$ = 2.0.

**Table 8.** Effect of dual-loss weight $\lambda$.

Case Studies and Conclusion

Real‑world case studies confirm AUDIO‑INTERACTION’s streaming gains and low latency.

Audio‑INTERACTION attains a perfect 100 % trigger / response rate on the cat‑meow cue, while the next‑best streaming baselines reach only 66.7 %.

Figure 10 shows the per‑model percentages for the cat‑meow scenario.

**Figure 10.** Case studies show AUDIO-INTERACTION’s gains over SOTA streaming models. In the second, other models detect the cat mostly through the transcribed words "meow", while AUDIO-INTERACTION handles the audio cue directly via native streaming training.

On four real‑world recording scenarios, AUDIO‑INTERACTION’s trigger accuracy drops only 3.1 % relative to synthetic streams (58.9 % vs 62.0 %).

Section A.1 reports average accuracies across Travel, Work, Home, and Commute recordings.

Travel and Commute recordings exhibit the largest accuracy loss, driven by dense crowd ambience and non‑stationary noise that raise ASR word‑error‑rate to roughly 7.9 % and 8.6 % respectively; Work remains closest to synthetic performance, while Home preserves trigger accuracy but incurs a few false positives from benign kitchen sounds.

Appendix: Data Construction Details

Describes the TFJP pipeline that cleans and segments raw audio into streaming‑ready clips.

The TFJP module (Section 3.2) prepares raw recordings for streaming by applying a fixed‑size STFT and then passing the spectrogram through a cascade of six operators.

`SILENCE_CUT` removes silent segments longer than the silence limit $\tau$ by gating at the 10th percentile of frame energy.

`NOISE_PROFILE` computes a stationary noise spectrum from the lowest‑energy 5 % of frames.

DENOISE performs spectral subtraction using a gating coefficient $\gamma$ = 1.0.

`CORE_LOCATE` selects the contiguous span that maximizes a normalized energy‑over‑spectral‑entropy score.

`BOUNDARY_NORM` snaps the selected span to the nearest $\delta$ = c/2 = 200 ms boundary.

`SPEC_SMOOTH` applies a Hann taper of length $\omega$ = 20 ms at both ends of the span.

The default silence threshold is $\tau$ = 300 ms, the streaming chunk size is c = 400 ms, and the iteration cap K = 3 (Algorithm 1) is triggered on fewer than 2 % of clips during corpus construction.

Appendix: Inference Implementation

Details the streaming inference pipeline and token‑label construction for AUDIO‑INTERACTION.

Algorithm 4 builds a single long‑form streaming waveform by concatenating foreground clips sequentially, re‑applying TFJP at each junction, and mixing in background and ambient clips at gains of 0 dB, −6 dB, and −12 dB respectively.

Two independent noise tracks—one event‑like, one ambient—are tiled across the full duration, cross‑faded at boundaries, and mixed at signal‑to‑noise ratios sampled from $P_{\text{snr}}=U(5,20)$ dB; the ambient track is kept 5 dB quieter to emulate real recordings.

The resulting pair $(y,T)$ is exactly what Algorithm 2 expects: the waveform $y$ is sliced into 400 ms chunks, each chunk is encoded, and the encoded features are merged with the response timeline $T$ to form the $\langle X,\,y_{\text{stream}},\,y_{\text{LM}}\rangle$ training tuple.

The same streaming routine is reused for all seven StreamAudio‑2M task categories; the only variation is which timestamps in $T$ carry a non‑empty response (e.g., ASR writes one entry per chunk, voice‑chatting writes at turn boundaries, proactive response writes only at safety‑critical events).

Algorithm 2 tokenizes each audio chunk $a_t$, appends encoder features to $X$, and constructs parallel streaming and language‑model targets $y_{\text{stream}}$ and $y_{\text{LM}}$ by inserting special tokens (⟨response⟩, ⟨eos⟩, ⟨silent⟩) and masking as dictated by the response schedule $R$.

Algorithm 3 runs FIFO‑Scheduled Asynchronous Streaming Inference by spawning an encoder loop and a decoder loop that operate concurrently on the incoming audio stream $x_{1:\infty}$, sharing a KV‑cache $C$ and a response queue $Q$ while maintaining the last token $r^{*}=⟨silent⟩$.

Appendix: Curation Pipeline Details

Describes the three‑stage prompt pipeline and the dual‑track streaming algorithm that assembles curated audio interactions.

The curation pipeline turns a loose bag of audio annotations into a coherent, streaming‑ready interaction. It proceeds in three deterministic stages: planning a plausible scenario, refining each sub‑event into retrieval‑ready queries, and verifying that candidate clips fit acoustically.

Dual‑Track Streaming Sequence Composition (Algorithm 4)

Stage 1 – Scenario Planning asks the model to compose a single real‑world scene that respects temporal ordering, acoustic compatibility, and role consistency. The output is a JSON object containing a one‑sentence description, an ordered list of sub‑events (foreground, background, ambient), and explicit constraints.

Stage 2 – Event Refinement expands each sub‑event into a concrete retrieval query and a fallback caption. Queries are 4–12 words long and include material, surface, and intensity cues so that an AudioSet‑style engine can discriminate near‑confusables.

Stage 3 – Clip Grounding Verification receives a candidate clip and decides whether it can be inserted without breaking acoustic consistency. The verifier checks identity, cleanliness, duration fit, and continuity, preferring “reprocess” over “accept” when uncertain.

Comprehension‑Aware Supervision adds two auxiliary prompts. The History‑Review Question Generation prompt creates a follow‑up question that forces the model to retain information from at least three turns earlier. The Silent‑Audio Verification prompt labels a clip as “respond” only for safety‑critical sounds, otherwise defaulting to “silent”.

Appendix: Dataset Sources

Details of the public corpora and auxiliary sources that compose the StreamAudio‑2M dataset.

StreamAudio‑2M is built from a curated pool of publicly available corpora, each chosen to supply a specific capability for streaming interaction. We prefer well‑established sources over scraped or proprietary collections to keep the dataset reproducible and to let the streaming pipeline handle the heavy transformations.

The speech‑centric portion draws from CommonVoice, GigaSpeech, LibriSpeech, VoxPopuli, CoVoST‑2 (both directions), AISHELL, and three variants of AudioSet, contributing tens of thousands of items and hundreds of hours of audio. Table nine (in the paper) lists each source’s item count and hour contribution, and the roles they play in the streaming regime are described below.

Speech‑centric sources support four offline capabilities that the streaming model must inherit: spoken dialogue, streaming ASR, speech‑to‑text translation, and audio question answering. MOSS provides the largest block of dialogue supervision; LibriSpeech is re‑segmented into 400 ms chunks to align with the AUDIO‑INTERACTION listening phase; CoVoST‑2 supplies bidirectional English‑Chinese translation pairs, both in their original offline form and stitched into continuous streams for simultaneous interpretation.

Acoustic‑event sources combine real recordings from AudioSet with synthetic clips generated by AudioX and ElevenLabs to fill sparsely covered safety‑critical categories. Together they yield 171 k clips that span the full Proactive Sound Bench taxonomy, ensuring every evaluation category appears during training.

Background noise is overlaid on every long‑form stream using three established corpora: MUSAN for music, ambient and speech babble; WHAM! for urban and reverberant scenes; and DNS Challenge for diverse environmental noise. This adds six hundred twenty hours of background at controlled SNRs, teaching the model to suppress responses to non‑foreground sound.

The assembled corpus also includes auxiliary resources such as UltraChat, Magpie Pro, DU QA, COIG CQIA, Web QA, and BellGroup, which provide additional conversational and factual grounding for the model’s multi‑task abilities.

**Figure 17.** Enter Caption

Appendix: Proactive Sound Bench

Defines the ProactiveSound‑Bench task and its comprehensive sound taxonomy.

A benchmark that asks a model to listen continuously, decide instantly whether to speak, and, if so, generate a helpful spoken reply.

ProactiveSound‑Bench groups test scenarios into six macro domains: Human Sound Signals, Daily Living Sounds, Nature & Environment, Equipment, Traffic, and Music. Human Sound Signals cover physiological cues such as crying, breathing, or distress vocalizations; Daily Living Sounds capture routine household activities; Nature & Environment includes weather and ecological sounds; Equipment spans household appliances and industrial tools; Traffic comprises vehicle and road noises; Music focuses on instrument performances and out‑of‑tune failures.

Read the original paper

Open the simplified reader on Paperglide