Audio Interaction Model

Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

A unified streaming audio-language model that enables real-time, comprehension-grounded interaction.

How can we unify disparate, offline audio tasks into a single, streaming-native model that processes audio continuously and decides when to respond?

Current Large Audio Language Models (LALMs) operate offline, requiring a complete audio clip before generating a response. This design fails to capture the continuous, interactive nature of real-world audio, where humans expect systems to listen, decide, and respond in real time. The authors introduce Audio-Interaction, a model that consumes audio in fixed-length chunks and uses a comprehension-grounded decision mechanism to determine whether to remain silent or respond at each step. This perceive-decide-respond loop is supported by the SoundFlow framework, which handles streaming-native data construction and asynchronous inference. Across eight benchmarks, the model preserves competitive performance on standard audio tasks while unlocking new capabilities like proactive assistance and real-time instruction following.

Paper Primer

The core mechanism is a chunk-level sequential decision process: the model acts as a gatekeeper that evaluates each 400ms audio chunk to emit a special token—either <silent> or <response>—before proceeding. This is like a traffic controller who monitors a continuous stream of vehicles and only signals for a stop when the pattern demands an intervention, rather than waiting for the entire road to clear.

Audio-Interaction achieves state-of-the-art performance on general audio understanding while enabling streaming interaction.

The model scores 58.15 on the MMAU benchmark, matching or exceeding offline baselines. It retains over 91% of its single-segment accuracy even when concatenating five-turn streams, where offline baselines collapse by more than 30%.

The asynchronous inference scheme significantly reduces latency.

By decoupling encoding from decoding via FIFO scheduling, the system eliminates inference stalling. First-frame latency is reduced by 4.5x compared to synchronous alternatives.

Why is a new framework needed instead of just streaming the input to existing LALMs?

Existing LALMs are designed for offline input-output mappings and lack the ability to decide *when* to respond based on unfolding context. Simply streaming audio to them does not solve the problem of false triggering or the inability to maintain long-range context across chunked inputs.

What is the scope of the Proactive-Sound-Bench?

It evaluates the model's ability to provide proactive assistance without explicit instructions, using 644 human-designed acoustic events that require the model to correctly trigger or abstain based on environmental context.

Introduction: The Need for Streaming Interaction

Audio‑Interaction unifies offline LALMs into a streaming, always‑on model.

Audio is a continuous, always‑on channel, yet today’s Large Audio Language Models (LALMs) operate offline and are tied to a single task such as streaming ASR or voice chatting.

Two fundamental gaps emerge: (C1) an interactive model must decide, at each audio chunk, whether to respond based on semantic understanding—not just acoustic cues; (C2) chunked inference breaks temporal continuity, forcing the model to reconstruct long‑range context without inflating latency.

A streaming model that continuously listens, decides whether to stay silent or speak, and produces output—all within a single perceive–decide–respond loop.

We realize this regime with the SOUNDFLOW framework, which (i) synthesizes interaction data via hierarchical event curation and a time‑frequency joint preprocessing module, (ii) trains with a chunk‑level sequential decision objective that mitigates context forgetting, and (iii) deploys FIFO‑scheduled asynchronous inference that decouples encoding from decoding, cutting first‑frame latency by 4.5×.

To train and evaluate this paradigm we build STREAMAUDIO‑2M, a 2.6 M‑item corpus covering seven core abilities and 28 sub‑tasks, and we release PROACTIVE‑SOUND‑BENCH, a benchmark of 644 human‑designed events probing proactive audio assistance.

**Figure 1.** AUDIO-INTERACTION listens to a continuous audio stream and decides at each moment whether to stay silent or speak, unifying conventional capabilities (e.g., dialogue, ASR) and streaming-native (e.g., simultaneous translation, proactive help) capabilities within a single model.

**Figure 2.** Human listening is a continuous activity. We take in sound moment by moment and judge for ourselves when a reaction is called for. Current audio models work the opposite way: they wait for a finished recording, answer once, and handle only one kind of task per system. AUDIO-INTERACTION closes this gap by processing sound as it arrives and judging, step by step, when to speak and when to hold back—letting one model cover what previously took many specialized ones.

The shift from offline, task‑specific audio models to a unified, streaming‑native interaction paradigm enables real‑time, multi‑capability behavior without sacrificing existing performance.

Data Curation and Training Sources

This section details the StreamAudio‑2M dataset construction and defines offline LALM baselines.

We first introduce the StreamAudio‑2M dataset, then describe the offline LALM baselines used for comparison.

StreamAudio‑2M pairs long‑form, real‑world‑simulated audio with token‑level annotations, covering seven core capabilities across 28 sub‑tasks.

**Figure 5.** STREAMAUDIO-2M is a dataset built for streaming audio interaction, pairing long-form, real-world-simulated audio with token-level annotations. It jointly trains the model to interact in real time grounded in context while covering 7 foundational capabilities across 28 sub-tasks.

Standard large audio language models that combine a Whisper‑style encoder, an adapter, and a language‑model backbone, requiring the full audio clip before generating a response.

**Figure 6.** Statistics of StreamAudio-2M. (a) The capability taxonomy spans seven core capabilities of a streaming audio model. (b) Round distribution, average response tokens, and silence proportion across tasks. (c) Statistics of source data.

**Algorithm 2** Streaming Sample Tokenization and Label Construction **Require:** instruction tokens $\mathcal{A}^{\text{ins}}$, audio chunks $a_{1:T}$, response timeline $\mathcal{R} = [(t_k, r_k)]_{k=1}^K$ sorted by $t_k$ **Ensure:** token sequence $X$, streaming target $y^{\text{stream}}$, LM target $y^{\text{LM}}$ 1: $X, y^{\text{stream}}, y^{\text{LM}} \leftarrow [], [], []$ 2: Append $\mathcal{A}^{\text{ins}}$ to $X$; extend labels with MASK 3: $k \leftarrow 1$ 4: **for** $t = 1$ to $T$ **do** 5: $\quad$ Append encoder features of $a_t$ to $X$; extend labels with MASK 6: $\quad$ **if** $k \leq K \land t_k = t$ **then** $\quad \triangleright$ response triggers at chunk $t$ 7: $\quad \quad$ Append $\langle\text{response}\rangle$; $y^{\text{stream}} += \langle\text{response}\rangle$, $y^{\text{LM}} += \text{MASK}$ 8: $\quad \quad$ **for** token $w$ in $r_k$ **do** 9: $\quad \quad \quad$ Append $w$; $y^{\text{stream}} += \text{MASK}$, $y^{\text{LM}} += w$ 10: $\quad \quad$ **end for** 11: $\quad \quad$ Append $\langle\text{eos}\rangle$; $y^{\text{stream}} += \text{MASK}$, $y^{\text{LM}} += \langle\text{eos}\rangle$ 12: $\quad \quad k \leftarrow k + 1$ 13: $\quad$ **else** $\quad \triangleright$ remain silent 14: $\quad \quad$ Append $\langle\text{silent}\rangle$; $y^{\text{stream}} += \langle\text{silent}\rangle$, $y^{\text{LM}} += \text{MASK}$ 15: $\quad$ **end if** 16: **end for** 17: **return** $X, y^{\text{stream}}, y^{\text{LM}}$

The Streaming Training Paradigm

How the model learns to listen, decide, and speak continuously.

Training a model that must decide in real time when to speak faces two practical problems: (1) it can lose earlier context in long streams, and (2) it may fire a response to irrelevant sounds.

The model treats the audio stream as an endless sequence and decides at each moment whether to keep listening or to emit a response, like a listener who flips a switch from “watch” to “talk” once enough of the story is heard.

How does the Streaming‑Native Paradigm differ from the offline LALM approach used in prior work?

Offline LALMs wait for a complete audio clip before producing any output, whereas the streaming paradigm makes a binary decision after each fixed‑size chunk, allowing immediate continuation of listening or instant response generation.

Chunk 1 arrives → model predicts <silent> because evidence is insufficient.

Chunk 2 arrives → still not enough context → predicts <silent> again.

Chunk 3 arrives → accumulated cues cross the decision threshold → predicts <response>.

Model switches to generation mode and outputs the token “Hello”.

This walk‑through shows how the model defers speaking until enough acoustic evidence is gathered, avoiding premature interruptions.

At each step the model computes $dt, rt = f_{\text{det}}(a_t, C_t)$. If $dt = \langle\text{silent}\rangle$, $rt$ is empty and the stream continues; otherwise $rt = f_{\text{resp}}(a_t, C_t)$ generates the textual response.

Format training: use offline data to teach the model the sequence format and the use of <`Spe_token`> (samples of the form (`A_ins`, `A_in` → T)).

Adapter training: learn a mapping from chunk‑wise acoustic representations into the language‑model space while keeping the same format.

Large‑scale streaming supervised training: jointly optimize adapter and LM on core capabilities (audio understanding, ASR, spoken dialogue) with both instruction‑only and instruction‑plus‑audio inputs.

Instruction‑following fine‑tuning: expose the model to interleaved sequences that require comprehension‑aware intervention and proactive response, e.g., (`A_ins`, `A_in`, T, …) and (`A_in` → T).

**Figure 3.** The training framework of SOUNDFLOW. Audio signals, intermediate representations, and supervision signals are organized into a unified temporal sequence, and a streaming training strategy jointly optimizes language modeling and response triggering, enabling AUDIO-INTERACTION to decide when to respond or remain silent across diverse real-time tasks.

FIFO-Scheduled Asynchronous Inference

How the streaming dataset is assembled and how inference stays stable.

Real‑time audio encoding together with the model’s special‑token silence response can cause waiting conflicts: the decoder may be idle while the encoder keeps feeding data, or it may block because the last token signals “keep listening”. This scheduling tension leads to inference stalls and high first‑frame latency.

The encoder pushes acoustic embeddings into a first‑in‑first‑out queue, and the decoder only pulls from the head of that queue when the previously emitted token was a silence or end‑of‑sentence marker.

How does this differ from a naïve FIFO buffer that the decoder simply reads every step?

In the naïve case the decoder would consume whatever is at the queue head regardless of the semantic state of the conversation, potentially interrupting a spoken response. FIFO‑Scheduled Asynchronous Inference adds a token‑based gate: the decoder only pulls when the last token signals “silence” or “end‑of‑sentence”, preserving turn‑taking semantics while still benefitting from FIFO ordering.

Encoder encodes $x_1\rightarrow a_1$ and appends to $Q$: $Q=[a_1]$.

Encoder encodes $x_2\rightarrow a_2$ and appends: $Q=[a_1,a_2]$.

Decoder sees $r_{t-1}=\langle\text{silent}\rangle$, flushes $Q$, extending cache $\mathcal{C}$ with $[a_1,a_2]$ and consumes them.

Decoder generates token $r_t$ (e.g., “yes”). Queue is now empty.

Encoder processes $x_3\rightarrow a_3$, appends: $Q=[a_3]$.

Decoder’s next token $r_{t+1}$ is a regular word, so it does not flush; $Q$ retains $a_3$ for the next silence turn.

The gate ensures that the model only “listens” when it is not speaking, preventing self‑interruptions and guaranteeing low‑latency resumption after a response.

FIFO‑Scheduled Asynchronous Streaming Inference (simplified).

**Figure 4.** SoundFlow's FIFO-scheduled asynchronous streaming inference. Audio chunks are appended to temporal queue; decoding is triggered when decoder is not speaking.

Data Collection: aggregate ~1.64 M foundational items (~8,900 h) from public corpora (MOSS, CommonVoice, GigaSpeech, LibriSpeech, VoxPopuli, CoVoST2, AISHELL, FMA, AudioSet) and add ~171 k acoustic‑event clips plus diverse noise sources.

Preprocessing: synthesize speech from textual sources using multi‑voice CosyVoice, then verify each utterance via LLM rewriting and ASR checking to ensure fidelity.

Sequence Concatenation: stitch validated utterances into multi‑turn streaming sequences per Section 3.2, overlaying dual‑track environmental noise to simulate real‑world conditions.

Token‑level Annotation: convert each streaming sequence into ⟨input ids, labels⟩ pairs, marking special tokens (

Performance and Benchmarks

Key performance gains of AUDIO‑INTERACTION across streaming and speech tasks.

AUDIO‑INTERACTION preserves audio understanding under streaming training, achieving 58.15 % accuracy on audio‑instruction MMAU and surpassing the Qwen2.5‑Omni‑3B baseline.

Table 1 shows 58.15 % for AUDIO‑INTERACTION versus 57.53 % for Qwen2.5‑Omni‑3B, while matching several 7 B systems despite a smaller 3 B parameter count.

**Table 1.** Results on the MMAU benchmark under text and audio instructions across three audio domains. Stream. and Multi-turn indicate streaming and multi-turn training support(- indicates not applicable).

**Table 2.** Performance score ($\uparrow$) on four spoken-dialogue benchmarks.

**Table 4.** Results on the Proactive-Sound-Bench. Equip. stands for Equipment. Sin. and Mul. denote Single-round and Multi-round respectively. Best and second-best results are highlighted.

**Figure 9.** Capability stability of AUDIO-INTERACTION as the stream extends from 1 to 5 concatenated segments. We report MMAU average accuracy, dialogue accuracy, and end-to-end latency.

Structural Analysis of Streaming

We quantify how streaming bridges the offline gap and isolate the key attention head driving silent‑vs‑response decisions.

The streaming architecture unifies discrete audio chunks into a continuous representation and routes the silent‑vs‑response decision through a single attention head.

Continuity ratio at the early decoder layer rises to $0.80$, eliminating most of the fragmentation caused by independent chunk embeddings.

Figure 7 shows encoder output continuity at $0.25$ and a projector shift of less than $0.02$, while GPT Layer 0 lifts continuity to $0.80$ in a single step.

**Scenario 1. Home — Weekend Childcare** A continuous 30-second household stream. audio-interaction listens every 0.4 s and decides whether to remain silent or speak; five of the seven StreamAudio-2M task categories are exercised in this single scene.

**Scenario 2. Office — Workday** A continuous 60-second office stream. audio-interaction listens every 0.4 s and decides whether to remain silent or speak; this scene exercises five of the seven StreamAudio-2M task categories.

Ablation Study

Ablations quantify how each design choice impacts latency, accuracy, and overall interaction quality.

We run four ablations, each removing a single component of the AUDIO‑INTERACTION system to measure its impact on latency, stall rate, and downstream metrics.

FIFO scheduling cuts first‑chunk latency by more than half and eliminates stalls.

Table 5 shows the “OURS” configuration (with FIFO) achieves 392 ms latency and 0 % stall versus 831 ms and 5.2 % without FIFO.

**Table 5.** effect of Asynchronous Infer.

Removing TFJP preprocessing drops trigger accuracy by 7.1 percentage points.

Table 6 compares variant V2 (full streaming SFT) with V3 (w/o TFJP); trigger accuracy falls from 92.4 % to 85.3 %.

Omitting hierarchical event selection reduces trigger accuracy by an additional 3.9 points.

Table 6 shows variant V4 (w/o event selection) at 88.5 % versus 92.4 % for the full V2.

**Table 6.** Ablation on streaming model training.

A 0.4 s chunk yields 392 ms latency while preserving accuracy.

Table 7 reports 392 ms latency and MMAU ≈ 58.2 for the 0.4 s setting, outperforming both smaller (0.2 s) and larger (0.6–0.8 s) chunks.

**Table.** Configuration and training parameters for the four stages of the model.

Setting $\lambda$ = 1.0 yields the highest trigger accuracy (96.9 %).

Table 8 reports 96.9 % trigger accuracy at $\lambda$ = 1.0, compared with 95.3 % at $\lambda$ = 0.5 and 96.7 % at $\lambda$ = 2.0.

**Table 8.** Effect of dual-loss weight $\lambda$.

Case Studies and Conclusion

Real‑world case studies confirm AUDIO‑INTERACTION’s streaming gains and low latency.

Audio‑INTERACTION attains a perfect 100 % trigger / response rate on the cat‑meow cue, while the next‑best streaming baselines reach only 66.7 %.

Figure 10 shows the per‑model percentages for the cat‑meow scenario.

**Figure 10.** Case studies show AUDIO-INTERACTION’s gains over SOTA streaming models. In the second, other models detect the cat mostly through the transcribed words "meow", while AUDIO-INTERACTION handles the audio cue directly via native streaming training.

On four real‑world recording scenarios, AUDIO‑INTERACTION’s trigger accuracy drops only 3.1 % relative to synthetic streams (58.9 % vs 62.0 %).

Section A.1 reports average accuracies across Travel, Work, Home, and Commute recordings.

Travel and Commute recordings exhibit the largest accuracy loss, driven by dense crowd ambience and non‑stationary noise that raise ASR word‑error‑rate to roughly 7.9 % and 8.6 % respectively; Work remains closest to synthetic performance, while Home preserves trigger accuracy but incurs a few false positives from benign kitchen sounds.

Appendix: Data Construction Details

Describes the TFJP pipeline that cleans and segments raw audio into streaming‑ready clips.

The TFJP module (Section 3.2) prepares raw recordings for streaming by applying a fixed‑size STFT and then passing the spectrogram through a cascade of six operators.

`SILENCE_CUT` removes silent segments longer than the silence limit $\tau$ by gating at the 10th percentile of frame energy.

`NOISE_PROFILE` computes a stationary noise spectrum from the lowest‑energy 5 % of frames.

DENOISE performs spectral subtraction using a gating coefficient $\gamma$ = 1.0.

`CORE_LOCATE` selects the contiguous span that maximizes a normalized energy‑over‑spectral‑entropy score.

`BOUNDARY_NORM` snaps the selected span to the nearest $\delta$ = c/2 = 200 ms boundary.

`SPEC_SMOOTH` applies a Hann taper of length $\omega$ = 20 ms at both ends of the span.

The default silence threshold is $\tau$ = 300 ms, the streaming chunk size is c = 400 ms, and the iteration cap K = 3 (Algorithm 1) is triggered on fewer than 2 % of clips during corpus construction.

Appendix: Inference Implementation

Details the streaming inference pipeline and token‑label construction for AUDIO‑INTERACTION.

Algorithm 4 builds a single long‑form streaming waveform by concatenating foreground clips sequentially, re‑applying TFJP at each junction, and mixing in background and ambient clips at gains of 0 dB, −6 dB, and −12 dB respectively.

Two independent noise tracks—one event‑like, one ambient—are tiled across the full duration, cross‑faded at boundaries, and mixed at signal‑to‑noise ratios sampled from $P_{\text{snr}}=U(5,20)$ dB; the ambient track is kept 5 dB quieter to emulate real recordings.

The resulting pair $(y,T)$ is exactly what Algorithm 2 expects: the waveform $y$ is sliced into 400 ms chunks, each chunk is encoded, and the encoded features are merged with the response timeline $T$ to form the $\langle X,\,y_{\text{stream}},\,y_{\text{LM}}\rangle$ training tuple.

The same streaming routine is reused for all seven StreamAudio‑2M task categories; the only variation is which timestamps in $T$ carry a non‑empty response (e.g., ASR writes one entry per chunk, voice‑chatting writes at turn boundaries, proactive response writes only at safety‑critical events).

Algorithm 2 tokenizes each audio chunk $a_t$, appends encoder features to $X$, and constructs parallel streaming and language‑model targets $y_{\text{stream}}$ and $y_{\text{LM}}$ by inserting special tokens (⟨response⟩, ⟨eos⟩, ⟨silent⟩) and masking as dictated by the response schedule $R$.

Algorithm 3 runs FIFO‑Scheduled Asynchronous Streaming Inference by spawning an encoder loop and a decoder loop that operate concurrently on the incoming audio stream $x_{1:\infty}$, sharing a KV‑cache $C$ and a response queue $Q$ while maintaining the last token $r^{*}=⟨silent⟩$.

Appendix: Curation Pipeline Details

Describes the three‑stage prompt pipeline and the dual‑track streaming algorithm that assembles curated audio interactions.

The curation pipeline turns a loose bag of audio annotations into a coherent, streaming‑ready interaction. It proceeds in three deterministic stages: planning a plausible scenario, refining each sub‑event into retrieval‑ready queries, and verifying that candidate clips fit acoustically.

Dual‑Track Streaming Sequence Composition (Algorithm 4)

Stage 1 – Scenario Planning asks the model to compose a single real‑world scene that respects temporal ordering, acoustic compatibility, and role consistency. The output is a JSON object containing a one‑sentence description, an ordered list of sub‑events (foreground, background, ambient), and explicit constraints.

Stage 2 – Event Refinement expands each sub‑event into a concrete retrieval query and a fallback caption. Queries are 4–12 words long and include material, surface, and intensity cues so that an AudioSet‑style engine can discriminate near‑confusables.

Stage 3 – Clip Grounding Verification receives a candidate clip and decides whether it can be inserted without breaking acoustic consistency. The verifier checks identity, cleanliness, duration fit, and continuity, preferring “reprocess” over “accept” when uncertain.

Comprehension‑Aware Supervision adds two auxiliary prompts. The History‑Review Question Generation prompt creates a follow‑up question that forces the model to retain information from at least three turns earlier. The Silent‑Audio Verification prompt labels a clip as “respond” only for safety‑critical sounds, otherwise defaulting to “silent”.

Appendix: Dataset Sources

Details of the public corpora and auxiliary sources that compose the StreamAudio‑2M dataset.

StreamAudio‑2M is built from a curated pool of publicly available corpora, each chosen to supply a specific capability for streaming interaction. We prefer well‑established sources over scraped or proprietary collections to keep the dataset reproducible and to let the streaming pipeline handle the heavy transformations.

The speech‑centric portion draws from CommonVoice, GigaSpeech, LibriSpeech, VoxPopuli, CoVoST‑2 (both directions), AISHELL, and three variants of AudioSet, contributing tens of thousands of items and hundreds of hours of audio. Table nine (in the paper) lists each source’s item count and hour contribution, and the roles they play in the streaming regime are described below.

Speech‑centric sources support four offline capabilities that the streaming model must inherit: spoken dialogue, streaming ASR, speech‑to‑text translation, and audio question answering. MOSS provides the largest block of dialogue supervision; LibriSpeech is re‑segmented into 400 ms chunks to align with the AUDIO‑INTERACTION listening phase; CoVoST‑2 supplies bidirectional English‑Chinese translation pairs, both in their original offline form and stitched into continuous streams for simultaneous interpretation.

Acoustic‑event sources combine real recordings from AudioSet with synthetic clips generated by AudioX and ElevenLabs to fill sparsely covered safety‑critical categories. Together they yield 171 k clips that span the full Proactive Sound Bench taxonomy, ensuring every evaluation category appears during training.

Background noise is overlaid on every long‑form stream using three established corpora: MUSAN for music, ambient and speech babble; WHAM! for urban and reverberant scenes; and DNS Challenge for diverse environmental noise. This adds six hundred twenty hours of background at controlled SNRs, teaching the model to suppress responses to non‑foreground sound.

The assembled corpus also includes auxiliary resources such as UltraChat, Magpie Pro, DU QA, COIG CQIA, Web QA, and BellGroup, which provide additional conversational and factual grounding for the model’s multi‑task abilities.

**Figure 17.** Enter Caption

Appendix: Proactive Sound Bench

Defines the ProactiveSound‑Bench task and its comprehensive sound taxonomy.

A benchmark that asks a model to listen continuously, decide instantly whether to speak, and, if so, generate a helpful spoken reply.

ProactiveSound‑Bench groups test scenarios into six macro domains: Human Sound Signals, Daily Living Sounds, Nature & Environment, Equipment, Traffic, and Music. Human Sound Signals cover physiological cues such as crying, breathing, or distress vocalizations; Daily Living Sounds capture routine household activities; Nature & Environment includes weather and ecological sounds; Equipment spans household appliances and industrial tools; Traffic comprises vehicle and road noises; Music focuses on instrument performances and out‑of‑tune failures.

Questions & answers

What is the main contribution of the Audio-Interaction paper?

The paper introduces Audio-Interaction, a model that processes audio in fixed-length 400ms chunks and emits either a <silent> or <response> token at each step to decide when to speak, enabling real-time streaming interaction rather than offline batch processing. It is supported by the SOUNDFLOW framework, the STREAMAUDIO-2M training corpus, and the PROACTIVE-SOUND-BENCH evaluation benchmark.

What problem does Audio-Interaction address?

Current Large Audio Language Models (LALMs) operate offline, requiring a complete audio clip before generating any response, which prevents them from handling the continuous, interactive nature of real-world audio. The paper identifies two specific gaps: the inability to decide semantically when to respond at each chunk (C1), and the loss of long-range temporal context across chunked inputs (C2).

Why is a new framework needed instead of simply streaming audio to existing LALMs?

Existing LALMs are designed for offline input-output mappings and lack the ability to decide when to respond based on unfolding context. Simply streaming audio to them does not solve the problems of false triggering or the inability to maintain long-range context across chunked inputs.

How does the core perceive-decide-respond mechanism work?

At each 400ms chunk, the model computes a decision token dt and a response rt via a detection function over the current audio chunk and accumulated context. If dt equals <silent>, the response is empty and listening continues; if dt equals <response>, the model generates a textual reply using a response function over the same inputs.

What is the SOUNDFLOW framework and what does it do?

SOUNDFLOW is the training and deployment framework for Audio-Interaction that performs three functions: it synthesizes interaction data via hierarchical event curation and a time-frequency joint preprocessing (TFJP) module, trains the model with a chunk-level sequential decision objective to mitigate context forgetting, and deploys FIFO-scheduled asynchronous inference that decouples encoding from decoding.

How does FIFO-Scheduled Asynchronous Inference work and why is it needed?

FIFO-Scheduled Asynchronous Inference spawns concurrent encoder and decoder loops sharing a KV-cache and response queue, but adds a token-based gate so the decoder only pulls from the queue when the last token signals <silent> or end-of-sentence. This prevents the decoder from interrupting an ongoing spoken response and cuts first-frame latency by 4.5× compared to a naïve approach.

What is the STREAMAUDIO-2M dataset?

STREAMAUDIO-2M is a 2.6 million-item training corpus covering seven core streaming audio abilities and 28 sub-tasks, built from publicly available sources including CommonVoice, GigaSpeech, LibriSpeech, VoxPopuli, CoVoST-2, AISHELL, AudioSet, MUSAN, WHAM!, and DNS Challenge, among others. It is designed to be reproducible by relying on established rather than scraped or proprietary collections.

What is PROACTIVE-SOUND-BENCH and what does it evaluate?

PROACTIVE-SOUND-BENCH is a benchmark of 644 human-designed acoustic events that evaluates a model's ability to provide proactive assistance without explicit instructions, requiring the model to correctly trigger a response or abstain based on environmental context. It spans six macro domains: Human Sound Signals, Daily Living Sounds, Nature & Environment, Equipment, Traffic, and Music.

What are the key experimental results reported in the paper?

Across eight benchmarks, Audio-Interaction preserves competitive performance on standard audio tasks while enabling new capabilities such as proactive assistance and real-time instruction following. In error analysis, Travel and Commute recordings show the largest accuracy loss with ASR word-error-rates of roughly 7.9% and 8.6% respectively, while Work environments remain closest to synthetic performance.

What are the limitations of Audio-Interaction acknowledged in the paper?

The paper notes that Travel and Commute acoustic environments cause the largest performance degradation due to dense crowd ambience and non-stationary noise, and that Home environments incur false positives from benign kitchen sounds. The paper does not claim to fully solve context forgetting or false triggering in all conditions.

How does Audio-Interaction differ from prior streaming or duplex audio models?

Audio-interaction shares chunk-wise sequential ingestion with full-duplex spoken dialogue models but must additionally reason over full-audio context, environmental sounds, paralinguistic cues, and explicit user instructions, making its intervention policy substantially richer. The paper states that despite rapid progress, existing audio large models are all offline and none can continuously understand sound, environment, and instructions in real time.

What are the training infrastructure and computational requirements?

Training runs on 32 NVIDIA H100 80GB GPUs using bf16 mixed precision and DeepSpeed ZeRO-2 sharding, and the full four-stage training recipe consumes roughly ten days of wall-clock time. The four stages vary in which parameters are trained and in batch size, gradient accumulation, learning rate, and number of steps.

What are the key hyperparameters and constants used in the system?

Fixed method-level constants include chunk size c = 400ms, fade window ω = 20ms, half-chunk alignment δ = 200ms, dual-loss weight λ = 1.0, and maximum stream length L_max = 60 chunks (24 seconds). Data-level constants include WER threshold τ_wer = 0.10, ASR retries R = 2, and SNR distribution P_snr = U(5, 20) dB.

How is training data constructed for the streaming paradigm?

Long-form streaming waveforms are built by concatenating foreground clips, re-applying the TFJP module at each junction, and mixing in background and ambient clips at gains of 0 dB, −6 dB, and −12 dB. Two independent noise tracks are mixed at SNRs sampled from U(5, 20) dB, with the ambient track kept 5 dB quieter to emulate real recordings, and the resulting waveform is sliced into 400ms chunks for training.

What is the curation pipeline for building training scenarios?

The curation pipeline proceeds in three deterministic stages: Stage 1 plans a plausible real-world scenario with temporal ordering and acoustic compatibility constraints; Stage 2 refines each sub-event into a 4–12 word retrieval query with material, surface, and intensity cues; Stage 3 verifies candidate clips for acoustic consistency, preferring reprocessing over acceptance when uncertain.

What is the Time-Frequency Joint Preprocessing (TFJP) module?

The TFJP module prepares raw recordings for streaming by applying a fixed-size Short-Time Fourier Transform (STFT) and then passing the resulting spectrogram through a cascade of six operators. It is re-applied at every junction when concatenating clips to build long-form streaming waveforms.

Where was the paper published and who are the authors?

The paper is available on arXiv at arxiv.org/abs/2606.05121. The paper does not specify the authors' names or a conference venue in the provided text.

Key terms

LALM (Large Audio Language Model): A large language model extended to process audio inputs, typically designed for tasks like speech recognition, audio question answering, or voice chat, but conventionally operating on complete audio clips rather than streams.
SOUNDFLOW: The end-to-end framework introduced in the paper that handles streaming-native data construction, chunk-level sequential decision training, and asynchronous inference for Audio-Interaction.
STREAMAUDIO-2M: A 2.6 million-item training corpus covering seven core streaming audio abilities and 28 sub-tasks, built from publicly available audio datasets to train the Audio-Interaction model.
PROACTIVE-SOUND-BENCH: A benchmark of 644 human-designed acoustic events used to evaluate a model's ability to proactively respond to or abstain from reacting to environmental sounds without explicit user instructions.
<silent> token: A special output token the model emits when it decides not to respond after processing a 400ms audio chunk, allowing listening to continue uninterrupted.
<response> token: A special output token the model emits when it decides to generate a textual reply after processing a 400ms audio chunk.
FIFO-Scheduled Asynchronous Inference: An inference strategy that runs audio encoding and language model decoding concurrently in separate loops, using a token-based gate to ensure the decoder only processes new input when the conversation state permits, reducing first-frame latency.
TFJP (Time-Frequency Joint Preprocessing): A preprocessing module that converts raw audio into a spectrogram via STFT and applies a cascade of six operators to prepare audio chunks for streaming ingestion.
Chunk-level sequential decision: A training objective in which the model makes a binary silent-or-respond decision after each fixed-size audio chunk, treating the stream as a sequence of discrete decision points rather than a single offline input.
Comprehension-grounded decision mechanism: A method for deciding when to respond that is based on semantic understanding of the audio content and accumulated context, rather than purely acoustic or energy-based triggers.
KV-cache (Key-Value cache): A memory structure in transformer-based models that stores previously computed attention keys and values so they do not need to be recomputed for each new token, enabling efficient sequential generation.
DeepSpeed ZeRO-2: A distributed training optimization technique that shards optimizer states and gradients across GPUs to reduce memory usage and enable training of large models on multi-GPU clusters.
SNR (Signal-to-Noise Ratio): A measure of the relative level of a desired audio signal compared to background noise, expressed in decibels (dB), used here to control how loudly noise is mixed into training audio.
WER (Word Error Rate): A metric for automatic speech recognition accuracy that measures the fraction of words in a transcript that are incorrectly recognized, with lower values indicating better performance.
Proactive assistance: The ability of a model to autonomously decide to respond to an audio event based on its content and context, without waiting for an explicit user instruction or query.
Full-duplex spoken dialogue: A conversational audio system in which both the user and the model can speak and listen simultaneously, as opposed to turn-based systems where only one party is active at a time.
Comprehension-Aware Supervision: A training technique that adds auxiliary prompts requiring the model to answer questions about earlier conversation history and to correctly label safety-critical sounds, reinforcing long-range context retention and precise trigger behavior.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers