Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Ke Lei, Yu Zhang, Changhao Pan, Xueyi Pu, Wenxiang Guo, Ruiqi Li, Zhou Zhao

SwanSphere generates streaming spatial audio from panoramic video and text using a decoupled autoregressive diffusion transformer.

How can we generate high-fidelity, synchronized spatial audio in real-time from video and text without the latency bottlenecks of traditional diffusion models?

Existing spatial audio models force a choice between high-fidelity generation and low-latency streaming, often failing to maintain precise spatial alignment with panoramic visual inputs. SwanSphere decouples global semantic planning from local audio rendering: a causal language model predicts semantic conditions, while a localized diffusion transformer synthesizes high-fidelity spatial audio patches. This architecture achieves a first-chunk latency of 0.21 seconds, significantly outperforming previous autoregressive and global diffusion baselines in both spatial accuracy and responsiveness.

Paper Primer

The core mechanism hinges on a divide-and-conquer paradigm: the model uses a causal language model to capture global temporal and spatial structures, which then conditions a Local Diffusion Transformer (LocDiT) to synthesize continuous spatial audio. This LocDiT acts like a high-speed rendering engine—it takes the semantic "blueprint" from the language model and fills in the high-frequency acoustic details for the current patch while referencing previous audio context to ensure seamless transitions.

SwanSphere achieves superior spatial localization and semantic consistency compared to state-of-the-art baselines.

The model reduced the aggregate angular error to 1.03 (vs. 1.27 for OmniAudio) and achieved an FD of 120.28 (vs. 157.67 for OmniAudio) on the hybrid test set. The model delivers a ~30x speedup in initial response time compared to global diffusion baselines, with a first-chunk latency of 0.21 seconds.

To ensure the model understands the physical geometry of 360° video, the authors introduced Spatial Video-Audio Contrastive (SVAC) learning. This strategy uses four types of negative samples—including spatially rotated audio and horizontally rotated video—to force the encoders to learn orientation-sensitive features rather than just generic semantic labels.

Why is a two-stage "semantic planning + local rendering" approach necessary here?

Global diffusion models suffer from high first-frame latency due to multi-step denoising across the entire sequence. By decoupling the process, SwanSphere can generate a semantic condition for a local patch and render it immediately, enabling streaming performance without sacrificing global coherence.

What is the role of the automated annotation pipeline in this framework?

Existing datasets lack precise spatial labels for FOA audio. The pipeline uses acoustic intensity vectors to estimate sound source azimuth and elevation, which are then fed into an MLLM to generate natural-language spatial captions, allowing the model to learn from both visual and textual spatial cues.

SwanSphere demonstrates that streaming spatial audio is viable by separating high-level semantic planning from low-level acoustic synthesis. Researchers can now prioritize low-latency, spatially-aware audio generation for VR/AR applications without relying on computationally expensive global diffusion.

Abstract

SwanSphere streams high‑fidelity spatial audio from video and text via a diffusion transformer.

Real‑time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies suffer a quality‑latency trade‑off and struggle to capture precise spatial cues from multimodal inputs.

We propose SwanSphere, a unified streaming framework that generates high‑fidelity spatial audio from panoramic video and text prompts. Its core contributions are: (1) a causal autoregressive diffusion transformer enabling streaming high‑quality audio; (2) a Spatial Video‑Audio Contrastive (SVAC) learning strategy aligning video encoders with the acoustic domain, combined with a multi‑objective online direct preference optimization (ODPO) scheme for strong spatial perception; (3) an automated annotation pipeline that produces detailed spatial captions to mitigate dataset scarcity.

Experiments show SwanSphere outperforms prior methods on both video‑to‑spatial and text‑to‑spatial audio generation tasks. Demos are available at the provided URL.

Introduction

We expose the latency‑quality trade‑off that hampers streaming spatial audio generation.

Immersive VR/AR experiences require not only high‑fidelity sound but also tight audio‑visual spatial coherence. Existing video‑to‑stereophonic systems struggle to keep both quality and speed when modeling sound directionality in the ambisonic domain.

Generating spatial audio with full‑sequence attention yields excellent sound quality but forces a large reconstruction error budget and a long first‑frame latency, which breaks real‑time streaming.

Full‑sequence attention requires an $N \times N$ matrix per channel: $96{,}000^2 \times 4 \approx 3.7 \times 10^{10}$ entries.

Storing this matrix in 32‑bit float consumes $3.7 \times 10^{10} \times 4\text{ B} \approx 148\text{ GB}$ of GPU memory.

Even if the model could fit, each diffusion step must scan the entire matrix; with 50 denoising steps the compute cost exceeds real‑time constraints.

This concrete scale shows why naïve full‑attention diffusion is infeasible for streaming: memory blows up to hundreds of gigabytes and latency grows far beyond a single video frame.

**Figure 1. Overview.** Left: The pipeline of audio caption generation. Middle: The streaming inference diagram of SwanSphere, which simultaneously supports panoramic video and textual descriptions as inputs. Right: Example results generated by SwanSphere. As shown above, our model accurately captures the spatial audio variation as the marching band moves from the front to the right side of the scene, manifested by a gradual decrease in audio intensity along the X-axis (front-back) and a gradual increase along the Y-axis (left-right). The example below also faithfully reproduces the immersive and enveloping sensation of a live musical performance in a concert hall.

Decoupling high‑level planning from low‑level acoustic rendering is essential for streaming‑ready spatial audio generation.

Related Work

Survey of prior video‑to‑audio methods and spatial audio generation approaches.

Video‑to‑audio generation has rapidly expanded, yet most works target mono or stereo output and ignore explicit spatial cues.

Current pipelines either use a latent diffusion model to produce a global audio sequence from video features, or they predict audio tokens autoregressively, but both treat the audio as a flat channel without spatial structure.

Standard latent diffusion model conditioned on video features for V2A synthesis.

Integrates contrastive audio‑visual pretraining with a diffusion V2A model to boost semantic consistency.

Uses mel‑spectrogram codebooks for autoregressive token generation in V2A.

Next‑token prediction paradigm guided by visual cues for V2A.

High‑frame‑rate video feature extractor aligns visual features with audio tokens temporally.

Causal decoder‑only transformer with a diffusion head for online low‑latency stereo audio generation.

Flow‑matching framework conditioned on multimodal inputs for audio synthesis.

Rectified flow matching with one‑step distillation to improve efficiency.

Joint modeling of audio, visual, and text modalities for video‑guided audio synthesis.

Generates binaural audio from natural‑language and image inputs.

Video‑guided binaural speech synthesis incorporating trajectory information.

Extracts acoustic field and semantic features from panoramic videos for audio synthesis.

Generates First‑Order Ambisonics by integrating camera parameters with visual cues.

End‑to‑end V2A framework that spatializes sound via object‑aware dynamic panning.

Audio language model combined with a latent diffusion editor for declarative stereo editing.

Collectively, these works demonstrate strong semantic alignment but leave a gap in streaming, high‑fidelity FOA generation from panoramic video.

The SwanSphere Framework

Method details the FOA foundation, contrastive learning, and the streaming SwanSphere generation pipeline.

High‑quality spatial audio generation traditionally suffers from a latency‑quality trade‑off: iterative diffusion yields fidelity but stalls streaming, while fast autoregressive models lack spatial coherence.

FOA packs a full 3‑D sound field into four channels: an omnidirectional pressure (W) and three directional velocity components (X, Y, Z).

Form a 4 × 2 matrix $a$ from the channels.

Apply the encoder $E$: each column is multiplied by a learned weight matrix $W_E\in\mathbb{R}^{128\times4}$ and passed through a non‑linearity, yielding $z\in\mathbb{R}^{128\times2}$.

Resulting latent frame 1 ≈ [0.12,…, 0.03] and frame 2 ≈ [0.15,…, ‑0.01] (values shown for illustration).

Continuous latents preserve the phase relationships across channels, which discrete codebooks would destroy.

SVAC forces the video encoder to attend to regions that actually produce sound, while the audio encoder learns to align its directional cues with those visual regions.

How does SVAC differ from a standard CLIP‑based video‑audio contrastive loss?

CLIP aligns whole‑frame embeddings without preserving geometric layout; SVAC explicitly injects spatial negatives (audio rotation, video rotation) and aligns video frames to audio timesteps via nearest‑neighbor replication, forcing the model to respect direction and timing.

The framework splits generation into a fast semantic planner that predicts a high‑level condition for each time patch, followed by a lightweight local diffusion transformer that fills in the fine‑grained spatial audio.

Patch 1: language model emits $h_1$; LocDiT receives only $h_1$ (no history) and generates latent $z_1$.

Patch 2: $h_2$ arrives; LocDiT now sees $z_1$ as left boundary and $h_2$, producing $z_2$.

Patch 3: $h_3$ arrives; LocDiT uses $z_2$ (right boundary) and $z_1$ (left boundary) together with $h_3$ to generate $z_3$.

Patch 4: final $h_4$ is processed with $z_3$ and $z_2$ as context, yielding $z_4$.

Only two previously generated patches are ever needed as context, so memory grows O(1) with sequence length.

Streaming inference loop for SwanSphere

Why not generate the entire FOA sequence with a single diffusion model?

A monolithic diffusion requires many denoising steps proportional to the full length, leading to prohibitive latency; the two‑stage autoregressive design reduces steps to a constant per patch while still capturing global semantics via $h_t$.

After generation, eight candidate audios are scored on spatial accuracy, semantic alignment, and perceptual fidelity; the best‑vs‑worst pair forms a preference that fine‑tunes the model toward the weighted reward.

**Figure 2.** Overview of the SwanSphere framework. The left side illustrates the training pipeline based on the teacher forcing strategy, which supports both video and textual modalities during training. The upper-right section details our SVAC (Spatial Video-Audio Contrastive Learning) strategy for enhancing the Video Encoder's alignment capability. The lower-right section introduces the Multi-Objective Preference Alignment post-training pipeline of SwanSphere.

SwanSphere achieves low‑latency streaming spatial audio by decoupling semantic planning from local diffusion and by training with SVAC and ODPO.

Experimental Results

SwanSphere decouples semantic planning from acoustic rendering to enable low‑latency streaming generation.

We evaluate SwanSphere with a dual‑evaluation strategy that combines objective metrics (FD, KL, directional errors) and subjective human studies (MOS‑SQ, MOS‑AF) to assess both audio fidelity and cross‑modal alignment.

SwanSphere generates the first audio chunk in 0.21 s, about ≈96× faster than the nearest baseline ViSAGe’s 20.19 s per chunk.

Table 1 reports ViSAGe’s inference time as 20.19 s while our model requires only 0.21 s for the first chunk.

**Figure 3.** Qualitative Comparison. The left column depicts sea waves positioned directly in front; our model generates distinct and rhythmic wave sounds. In the right column, featuring a marching band moving from the front toward the right side, the signal intensity of the X channel gradually decreases while the intensity of the Y channel increases accordingly.

Ablations and Dataset Details

Appendix details dataset construction, implementation specifics, and extra experimental results.

This appendix records the data pipeline, model‑training choices, and the extra experiments that support the main paper.

The automated spatial captioning pipeline extracts acoustic intensity vectors from the four FOA channels $W$, $X$, $Y$, $Z$, smooths the resulting azimuth–elevation trajectory, and feeds the JSON‑encoded path to Gemini 2.5 Pro together with the panoramic video.

DoA estimation computes $I_x=\operatorname{Re}\{W^{*}\!\cdot X\}$, $I_y=\operatorname{Re}\{W^{*}\!\cdot Y\}$, $I_z=\operatorname{Re}\{W^{*}\!\cdot Z\}$, averages them over 500 Hz–8 kHz, then derives azimuth $=\operatorname{atan2}(V_y,V_x)$ and elevation $=\arcsin\!\bigl(V_z/\|V\|\bigr)$ for each 1‑s segment.

Trajectory smoothing converts each azimuth–elevation pair to a unit 3‑D vector, applies a moving‑average window of three samples, and re‑projects to angles to avoid discontinuities at ±180°.

Multimodal fusion prompts Gemini 2.5 Pro with a role‑playing instruction that enforces the azimuth convention (0° = front, + = left, – = right) and asks for a concise 150‑word spatial description.

The FOA‑VAE re‑uses Stable Audio VAE weights, drops the MS‑STFT stage, and assigns equal loss weight (¼) to each of the four channels.

Curriculum learning augments the spatial training set with 1 M non‑spatial audio samples converted to pseudo‑FOA by summing stereo to $W$ and placing a random channel difference into one of $X$, $Y$, $Z$.

**Table 6.** Out-of-distribution evaluation on YT360-Test.

The SELD‑based weighted cosine similarity (wCS) metric, reported in Table 5, further validates that our generated audio aligns with ground‑truth spatial cues.

Conclusion

Summarises SwanSphere’s contributions, limitations, and broader impact.

SwanSphere is introduced as a multimodal streaming framework that generates high‑fidelity First‑Order Ambisonics from panoramic videos and text prompts. By separating high‑level semantic planning from low‑level acoustic rendering through a divide‑and‑conquer approach, it achieves both detailed reconstruction and low‑latency inference.

The paper adds Spatial Video‑Audio Contrastive Learning (SVAC) and a multi‑objective ODPO fine‑tuning stage, which together boost orientation awareness and remove neural artifacts. These components are shown to significantly improve spatial accuracy and auditory fidelity beyond what global semantic alignment alone can achieve.

Extensive experiments demonstrate state‑of‑the‑art results on both video‑to‑spatial‑audio and text‑to‑spatial‑audio benchmarks, outperforming prior cascaded and unified models in semantic fidelity, spatial precision, and inference efficiency. Moreover, the streaming architecture enables real‑time immersive environments, delivering a Time‑to‑First‑Chunk latency of only 0.21 s.

Despite these gains, the current system has limitations: spatial captions focus on dominant sound sources, so complex multi‑source scenarios such as concerts are not fully captured, limiting fine‑grained spatial disentanglement. Future work will expand the dataset to include richer multi‑source scenes and explore robust generalisation to unseen recording setups.

Ablation studies reveal that removing the history encoder raises $FD$ from 120.28 to 128.15 and $KL$ from 1.31 to 1.42, confirming the importance of temporal conditioning. Reducing model capacity (SwanSphere‑M, SwanSphere‑S) consistently degrades semantic and spatial metrics, validating the necessity of the large 1.09 B‑parameter SwanSphere‑L. Compared with a DiT model of similar size, SwanSphere achieves roughly a 30× speed‑up while delivering better $FD$ and angular error.

The authors acknowledge the dual‑use nature of high‑fidelity spatial audio generation: while it can enrich VR/AR and multimedia creation, it also enables hyper‑realistic spatial deepfakes. To mitigate misuse, they recommend embedding watermarks in generated audio and restricting the open‑source license to non‑commercial research use. The released dataset has been carefully screened to remove personally identifiable information and harmful content.

This work was supported by the National Natural Science Foundation of China under Grant No. U25B2064.

Read the original paper

Open the simplified reader on Paperglide