SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Ruiqi Li, Yu Zhang, Changhao Pan, Ke Lei, Xiang Yin, Cheng Yang

SwanVoice uses a flow-matching DiT and a pause-aware data pipeline to synthesize expressive, long-form multi-speaker dialogue.

How can we achieve high-fidelity, zero-shot speech synthesis that handles both long-form monologues and complex, multi-speaker dialogues without relying on manual phoneme conversion?

Standard text-to-speech models struggle with long-form dialogue, often producing disjointed turns that fail to maintain consistent acoustic environments or affective continuity. SwanVoice treats dialogue as a single generation problem, using a flow-matching Diffusion Transformer (DiT) conditioned on speaker-turn IDs and pause-aware text tokens. On the SwanBench-Speech benchmark, the model outperforms open-source baselines in expressive richness and hierarchical coherence for both monologue and multi-speaker dialogue.

Paper Primer

The core mechanism is a three-stage curriculum that transitions from monologue pretraining to mixed-speaker conversational data, followed by post-training with DiffusionNFT rewards. This approach uses a 25 Hz VAE to compress speech latents and a flow-matching DiT that estimates the velocity field between noise and the target latent, conditioned on speaker-turn embeddings.

SwanVoice achieves superior expressive richness and hierarchy in long-form dialogue compared to existing open-source models.

Evaluation on SwanBench-Speech shows gains of 0.53 and 0.56 points in richness and hierarchy, respectively, over the strongest evaluated baselines. 3.62/3.71 (Richness/Hierarchy) vs. the next best baseline.

The data pipeline, SwanData-Speech, is critical to this performance. It uses a custom forced aligner to replace semantic punctuation with pause-aware tokens, ensuring the model learns natural prosody and turn-boundary behavior rather than relying on written-style punctuation.

Why is a dedicated forced aligner necessary for this system?

ASR-generated punctuation is optimized for readability, not acoustic pause structure. Using it for training leads to inconsistent pause control, whereas the Swan Forced Aligner grounds textual units in actual acoustic pauses to improve prosodic coherence.

How does the model handle the transition between different speakers in a single dialogue?

The model uses speaker-turn labels wrapped around text content and a flow-matching DiT that conditions on these turn-specific embeddings, allowing it to maintain speaker identity and acoustic scene consistency across turns.

Introduction and Motivation

Long‑form zero‑shot TTS struggles with multi‑speaker dialogue, prompting a new data pipeline and flow‑matching model.

Zero‑shot TTS has become reliable for single‑speaker narration, yet extending it to expressive long‑form multi‑speaker dialogue remains difficult. The naïve fix—synthesising each turn with a monologue model and stitching the results—adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. This section frames the gap that SwanVoice closes by replacing manual phoneme front‑ends with a robust forced‑alignment pipeline and by training a flow‑matching model with online reinforcement learning.

The pipeline turns raw, in‑the‑wild recordings into clean monologue and dialogue training sets by aligning words, separating speakers, and filtering for expressive quality.

30 s ÷ 0.375 s per word ≈ 80 words → matrix size 80 × 80.

80 × 80 × 4 B = 25 KB memory for the toy case.

10 min ÷ 0.375 s per word ≈ 1 600 words → matrix size 1 600 × 1 600.

1 600 × 1 600 × 4 B ≈ 10 MB, already a sizable chunk of GPU memory for a single layer.

This scaling shows why end‑to‑end flow‑matching, which avoids materialising the full attention map, is essential for expressive long‑form synthesis.

SwanVoice builds on this data: a 25 Hz VAE compresses the speech sequence, raw text is enriched with pause symbols and pinyin variants for Chinese, and a flow‑matching DiT conditioned on speaker‑turn IDs generates the waveform. Training proceeds from monologue data to mixed and real dialogue, then receives DiffusionNFT post‑training rewards that penalise pronunciation errors and encourage speaker similarity.

The shift from G2P‑dependent pipelines to an end‑to‑end flow‑matching model eliminates brittle phoneme front‑ends and enables expressive, speaker‑consistent long‑form synthesis.

Data Processing Pipeline

How the SwanData‑Speech pipeline turns raw recordings into speaker‑aware training corpora.

Raw audio—2.59 M h from internal and public corpora—contains long recordings, multiple speakers, and noisy environments, which makes direct use for TTS training infeasible.

The pipeline first strips away background noise, then splits each long recording into speaker‑ordered chunks, and finally routes monologue and dialogue streams through parallel transcription‑and‑filtering tracks.

Speech enhancement removes background chatter from all three recordings.

Diarization splits recording 2 into six speaker‑ordered utterances (A, B, C, A, B, C) and recording 3 into four short utterances (Speaker X, silence, Speaker Y, silence).

Segments shorter than 0.1 s are dropped; the remaining utterances are merged respecting the 2‑s silence rule, yielding monologue chunks (≤ 60 s) and dialogue chunks (2–4 speakers, ≤ 120 s).

Monologue chunks go through ASR → punctuation → quality filter → expressiveness filter, producing the “One Speaker Corpus”.

Dialogue chunks follow the same chain but retain multiple speaker labels, producing the “Dialogue Corpus”.

The hierarchical split lets us apply stricter quality thresholds to dialogue (where speaker turn‑taking errors are more harmful) while keeping monologue data abundant for single‑speaker TTS.

**Figure 1.** Hierarchical data processing pipeline

Beyond raw processing, the pipeline also generates hard‑case synthetic speech (RobustMegaTTS3) using an LLM to cover rare pronunciations and code‑switching scenarios, ensuring the downstream TTS model sees the full linguistic diversity of the target domains.

How does this pipeline differ from a naïve “enhance → ASR → filter” workflow?

The key difference is the speaker‑aware branching: after diarization, monologue and dialogue streams are treated separately, allowing length caps, speaker‑count constraints, and distinct quality filters that a flat pipeline would apply uniformly, which would either discard useful dialogue or retain noisy monologue segments.

Data Filtering and Transcription

We align transcription pauses to insert punctuation and filter audio for high-quality training.

Prosody in synthesized dialogue collapses when punctuation in the transcript does not line up with actual acoustic pauses, especially at speaker turn boundaries.

The forced aligner timestamps each character, then we insert punctuation symbols that directly reflect measured pause lengths.

Gap between “i” (0.05 s) and the space (0.07 s) = 0.02 s → ignored.

Gap between the space after “you” (1.05 s) and end‑of‑utterance is > 0.45 s → insert a period.

Gap between “you” (1.05 s) and the preceding space (0.72 s) = 0.33 s → insert a comma.

Resulting punctuation: “hi how are you,.” (comma followed by period).

The short‑pause token <|sp|> preserves natural breath groups, while the comma/period insertion aligns textual breaks with audible pauses, directly fixing the prosody mismatch.

How does this differ from a standard text‑normalization step that simply inserts commas after conjunctions?

Standard text‑normalization relies on syntactic cues and ignores the actual timing of speech; our method bases punctuation on measured silence durations, so the inserted commas and periods reflect real breathing patterns rather than grammatical guesses.

Each audio sample is scored by non‑intrusive perceptual metrics; low‑scoring samples are discarded, and high‑confidence emotional samples are earmarked for expressive training.

Clip B fails the DNSMOS threshold (set at 2.5) and is discarded.

Clip A passes all quality thresholds and is kept for the base training set.

Clip C also passes quality thresholds and, because its emotion confidence exceeds 0.9, it is added to the high‑expressiveness subset.

The pipeline removes acoustically poor samples while automatically curating a diverse expressive subset, which speeds up later model convergence on expressive speech.

Why not simply keep all samples and let the model learn to ignore low‑quality audio?

Training on noisy or unintelligible audio corrupts the latent space and forces the model to allocate capacity to modeling artifacts, which degrades overall synthesis quality and slows convergence.

The encoder compresses a waveform into a low‑dimensional latent vector, and the decoder reconstructs the waveform, enabling controllable generation and efficient alignment.

How is this VAE different from a plain autoencoder that directly maps waveform to waveform?

Because the VAE samples $z$ from a distribution, it learns a smooth latent space that supports stochastic generation and interpolation, whereas a deterministic autoencoder cannot sample novel speech without overfitting to the training set.

Model Architecture and Tokenization

This section details the end-to-end tokenization pipeline and the flow-based Transformer architecture powering SwanVoice.

SwanVoice replaces manual phoneme frontends with an end-to-end approach, simplifying preprocessing while improving pronunciation robustness. The core mechanism relies on a flow-based Transformer that integrates text, speaker identity, and latent audio features to synthesize speech.

The tokenizer maps raw text directly to model inputs using Byte Pair Encoding (BPE), eliminating the need for a separate grapheme-to-phoneme (G2P) frontend. This allows the model to learn context-dependent pronunciations end-to-end.

**Figure 2.** Overall training and inference procedure of SwanVoice.

The model estimates the vector field over a latent trajectory, transforming Gaussian noise into a clean target latent. By processing text and speaker conditions through a lightweight Transformer stack before interacting with speech latents, the model improves in-context conditioning.

Monologue pretraining: Train on 2 million hours of monologue speech to establish high-fidelity acoustic modeling and alignment.

Mixed conversational training: Introduce concatenated 2–4-speaker conversational data to learn speaker-turn assignment.

SFT training: Fine-tune on real conversational data to capture emotional coherence and recording-environment consistency.

Flow-GRPO Reward Mechanism

Defines the reward model and introduces Flow‑GRPO for online RL.

Deterministic ODE sampling yields low‑variance speech but offers no stochastic exploration, making online reinforcement learning impractical. Flow‑GRPO converts the ODE into an equivalent SDE to enable exploration, yet its denoising‑reduction strategy still incurs high training cost. DiffusionNFT sidesteps these issues by optimizing the forward flow directly with a simple reward‑driven RL loop.

It injects stochastic noise into the otherwise deterministic sampling ODE, turning the trajectory into a stochastic differential equation so the policy can explore alternative speech paths during training.

Step 0: start from latent $z_0=0$.

Step 1: apply drift $d=0.5$ and sample noise $\epsilon_1=0.08$, yielding $z_1 = z_0 + d + \epsilon_1 = 0.58$.

Step 2: apply drift $d=0.5$ and sample noise $\epsilon_2=-0.04$, yielding $z_2 = 0.58 + 0.5 - 0.04 = 1.04$.

Decode $z_2$ to a speech waveform $\hat{x}$ and compute its reward (e.g., $r=0.73$).

Update the policy parameters toward actions that increased $r$, using the single final reward.

Only the final noisy sample matters for the RL signal, so the intermediate noise injections provide exploration without extra memory overhead.

How does Flow‑GRPO differ from standard diffusion training that also adds noise?

Standard diffusion trains a denoiser to reconstruct every intermediate noisy state, requiring many forward passes per sample. Flow‑GRPO injects noise solely to create a stochastic trajectory and discards all intermediate latents, using only the final clean sample for reward evaluation, which dramatically reduces compute.

DiffusionNFT further simplifies the pipeline: it treats the forward flow as a policy, optimizes directly with the flow‑matching objective, and needs only the final clean sample and its scalar reward.

Policy Optimization Strategy

Policy optimization reshapes noisy rewards into stable, preference‑driven updates.

Standard online updates for flow‑matching suffer from high variance because each prompt yields many stochastic candidates. Without a principled way to turn those noisy rewards into a coherent learning signal, the policy drifts or collapses.

The method treats each sampled candidate as a diffusion step, computes a prompt‑wise advantage, and converts that advantage into a soft preference weight that steers the online policy toward the “positive” denoising direction when the weight is high and toward an implicit negative direction when it is low.

Mean reward $\bar{r}= (0.8+0.5+0.2)/3 = 0.5$.

Advantages $A = [0.3,\,0.0,\,-0.3]$.

Clipped advantages $\tilde{A}= [0.3,\,0.0,\,-0.3]$ (already within $[-0.5,0.5]$).

Weights $w = \operatorname{clip}(\tilde{A}/1.0,0,1) = [0.3,\,0.0,\,0.0]$.

For candidate 1 ($w=0.3$) the loss heavily weights the positive branch; for candidates 2‑3 ($w=0$) the negative branch dominates.

Even a modest reward gap yields a sharply different training signal because the weight mapping compresses the advantage range into $[0,1]$.

How does this differ from a standard REINFORCE policy‑gradient update?

REINFORCE would multiply the raw reward (or advantage) by the gradient of the log‑policy, leading to high variance and no explicit bias toward “good” denoising directions. DiffusionNFT‑style optimization first normalizes advantages, clips them, and then blends a stable old policy into two deterministic branches, turning the stochastic reward signal into a low‑variance, direction‑aware loss.

For post‑training we gathered 3 K real‑world conversation recordings, corrected pause annotations, and optimized only phone‑level WER and speaker similarity. The resulting model inherits the pretrained speech quality while gaining better environment consistency and expressiveness.

The inference stage separates content and speaker/style guidance into two independent scales, allowing the user to increase reference influence without altering textual fidelity.

Experimental Setup and Results

SwanVoice swaps manual phoneme front‑ends for forced alignment and trains flow‑matching with online RL.

Recall that SwanVoice replaces a hand‑crafted phoneme frontend with a forced‑alignment data pipeline and optimizes its flow‑matching model via online reinforcement learning.

SwanVoice sets a new state of the art on expressive richness and hierarchy, beating every baseline in Table 1.

Table 1 shows SwanVoice achieving Richness 2.94 and Hierarchy 3.01, the highest values among the 11 evaluated models.

**Table.** Evaluation results of various TTS models across three dimensions: Acoustics, Semantics, and Expressiveness.

**Table 2.** Results of dialogue generation models across SwanBench-Speech metrics. The best and second-best results are marked in bold and underlined, respectively, for each metric.

SwanVoice delivers the strongest expressive performance on both monologue and dialogue TTS tasks while staying on par with baselines in acoustic and semantic quality.

Benchmark Evaluation

Key zero‑shot results show SwanVoice outperforms all open‑source baselines.

SwanBench‑Speech is a curated benchmark that measures how well a TTS system expresses emotion, controls vocal dynamics, and fits the spoken content to a given scene.

SwanVoice achieves higher Richness (3.81) and Hierarchy (3.62) than any open‑source baseline in zero‑shot monologue synthesis.

Table 1 shows SwanVoice’s Richness of 3.81 and Hierarchy of 3.62, beating the strongest baseline VibeVoice by 0.39 and 0.56 points respectively.

Across both monologue and dialogue settings, SwanVoice also leads on Hierarchy (3.71) and Prosody (3.70), while maintaining competitive Sound Fidelity and low Content Error. Baselines such as VibeVoice and ZipVoice‑Dialog lag behind by 0.3–0.6 points on these expressive dimensions, confirming the advantage of the forced‑alignment pipeline and online RL optimization.

The evaluation also highlights remaining gaps: SwanVoice’s Content Error (0.145) is slightly higher than the best baseline, and speaker‑turn conditioning occasionally fails when speakers sound similar.

Future work should target pronunciation control, finer‑grained alignment, and more robust speaker‑turn modeling to close these gaps.

Swan Forced Aligner

Introduces the Swan Forced Aligner, its topology, scoring, training, and inference.

The paper’s premise is that zero‑shot TTS needs reliable word‑level timing; the Swan Forced Aligner supplies that timing by grounding text directly in the audio signal.

ASR transcripts carry punctuation that reflects readability, not the actual pauses in speech. When such transcripts supervise a TTS model, the model learns mismatched pause cues, leading to erratic prosody.

The Swan Forced Aligner solves this mismatch by aligning a given transcript to the waveform with an explicit word‑blank lattice, producing accurate start/end times and confidence scores for each word.

It treats alignment as a structured path through alternating word and blank states, letting the model decide where pauses belong while staying monotonic with the transcript.

Compute unary similarities $u_{t,s}$ for each frame $t\in\{1,2,3,4\}$ against the five states $S=\{b_0,w_1,b_1,w_2,b_2\}$ (e.g., $u_{1,b_0}=0.9$, $u_{2,w_1}=0.8$, …).

Apply per‑frame canonicalization to obtain $\tilde{u}_{t,s}$ (subtract frame‑wise mean, divide by std).

Compute transition scores $\tau(s,r)$ for stay, adv₁, adv₂ (e.g., $\tau(w_1,\text{adv}_2)=0.6$ for skipping $b_0$).

Scale with gains $\gamma_u=1.2$, $\gamma_\tau=0.9$ to get $u^{*}_{t,s}$ and $\tau^{*}(s,r)$.

Run Viterbi decoding on the lattice; the highest‑scoring monotonic path is $z^{*}= (b_0, w_1, b_1, w_2, b_2)$.

Recover word intervals: $w_1$ occupies frames 2–2 (start = frame 2, end = frame 2), $w_2$ occupies frames 4–4.

This toy example shows how the interleaved topology forces the decoder to assign exactly one blank segment between words, and how calibrated scores make the Viterbi path robust to small score variations.

How does the interleaved word–blank topology differ from the blank token used in standard CTC?

CTC treats blanks as a single global symbol that can appear anywhere, which makes it impossible to model variable‑length pauses between specific words. The Swan Forced Aligner introduces a distinct blank state for each gap (including start and end) and conditions internal blanks on the neighboring word embeddings, so the model can represent short coarticulations, long pauses, and phrase‑level boundaries explicitly.

**Figure 3.** Overview of Swan Forced Aligner.

Training combines frame‑level cross‑entropy, a CRF loss over the whole lattice, duration supervision for words and blanks, and a monotonicity regularizer; all terms are summed with unit weight unless otherwise noted.

At inference the model first computes calibrated unary and transition scores, then either runs Viterbi to obtain the highest‑scoring path or performs forward–backward to get posterior scores and decodes a path constrained by the same topology; word‑level confidences are averaged over the assigned frames.

Additional Experimental Analysis

Ablation results show Swan Forced Aligner’s AAS improvements across Chinese and English benchmarks.

We evaluate the Swan Forced Aligner on three public alignment benchmarks using a dedicated 80 K‑hour Chinese‑English training set drawn from audiobooks, podcasts, meetings and live streams.

The model builds on a pretrained WavLM encoder and two bidirectional Transformers (4‑layer text‑side, 16‑layer audio‑side) totaling roughly 400 M parameters, trained on 24 A100 GPUs with a batch of 4 hours for 80 K steps.

We compare against five mainstream forced aligners: Monotonic‑Aligner (non‑autoregressive, Chinese‑only), NeMo Forced Aligner (CTC‑based ASR), WhisperX (VAD + phoneme alignment), Qwen3 Forced Aligner (parallel slot‑filling), and the proprietary LattifAI Aligner.

**Table 3.** Comparison of AAS (ms) on different datasets.

Swan Forced Aligner reduces AAS by 13.11 ms relative to Monotonic‑Aligner on the Chinese GTSinger‑Speech benchmark.

Table 3 shows 45.19 ms for Swan versus 61.98 ms for Monotonic‑Aligner.

On LibriSpeech‑Clean, Swan Forced Aligner improves AAS by 59.38 ms over the NeMo baseline.

Table 3 lists 27.67 ms for Swan versus 87.05 ms for NeMo.

For LibriSpeech‑Others, Swan Forced Aligner is within 0.18 ms of the Qwen3 Forced Aligner.

Table 3 records 29.92 ms for Swan and 29.74 ms for Qwen3.

On the same dataset, Swan Forced Aligner trails the proprietary LattifAI Aligner by 10.23 ms.

Table 3 shows 29.92 ms for Swan versus 36.00 ms for LattifAI; the lower value is better, so the gap is 10.23 ms.

Read the original paper

Open the simplified reader on Paperglide