Lip Forcing: Few-Step Autoregressive Diffusion for Real-Time Lip Synchronization

Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim

Lip Forcing distills bidirectional diffusion teachers into two-step causal students for real-time lip synchronization.

How can we distill high-fidelity, multi-step diffusion models into a two-step causal student for real-time, streaming lip synchronization?

Diffusion-based lip synchronization models produce high-quality results, but their reliance on full-sequence bidirectional attention and dozens of denoising steps makes them too slow for real-time streaming. Lip Forcing distills these heavy bidirectional teachers into causal, two-step students by using a trajectory-analysis-derived recipe that gates guidance and rewards synchronization. This approach enables real-time performance at 31 FPS for a 1.3B model, while the 14B student achieves the best visual fidelity in its class at nearly 40× the speed of its teacher.

Paper Primer

Lip Forcing is a distillation framework that compresses a 50-step bidirectional teacher into a two-step streaming student. It hinges on a trajectory-level analysis of the teacher, which reveals that different denoising timesteps respond differently to audio conditioning: early steps favor reference fidelity, while mid-trajectory steps drive lip articulation.

The method implements this insight through three components: Sync-Window DMD (SW-DMD) which applies classifier-free guidance (CFG) only within the sync-favoring mid-trajectory band; a two-step inference schedule that lands at an analysis-derived optimal point; and a SyncNet-based reward that explicitly penalizes lip-sync errors. The student acts as a mail sorter: it skips the redundant bidirectional context and uses the analysis-derived window to route audio-visual information into the final output in just two steps.

Lip Forcing enables real-time streaming lip synchronization with sub-millisecond latency.

The 1.3B student reaches 31 FPS on an NVIDIA H100, crossing the 25 FPS real-time threshold, while the 14B student runs 39.8× faster than its teacher. 17.6× to 39.8× speedup over bidirectional baselines.

The method achieves superior visual fidelity compared to existing few-step diffusion baselines.

The 14B student achieves the best FVD (Fréchet Video Distance) among all tested models, outperforming LatentSync and other multi-step diffusion baselines. 4.7× faster than LatentSync at comparable reference fidelity.

Why is a two-step inference schedule sufficient for lip synchronization?

Trajectory analysis shows that the teacher model does not require dense traversal for coarse structure; a two-step schedule captures the necessary lip-sync articulation when the second step is strategically placed at the mid-trajectory landing point identified by the analysis.

How does this approach differ from standard distillation methods?

Standard distillation often uses fixed guidance scales, which forces a suboptimal compromise between identity preservation and lip movement. Lip Forcing uses a windowed guidance schedule (SW-DMD) to apply CFG only where it improves sync, while keeping other steps CFG-free to preserve reference fidelity.

Lip Forcing demonstrates that trajectory-aware diagnostics can replace generic distillation recipes, allowing high-fidelity diffusion models to be deployed in latency-sensitive applications like live translation and interactive agents.

Introduction

Diffusion lip‑sync models are too slow; Lip Forcing enables real‑time streaming.

Diffusion‑based lip‑sync models achieve high visual quality but require full‑sequence bidirectional attention and dozens of denoising steps, making them impractical for real‑time inference.

Lip Forcing compresses a high‑fidelity bidirectional diffusion teacher into a causal student that generates video chunks in just two denoising steps, eliminating inference‑time CFG.

How does Lip Forcing differ from standard few‑step diffusion distillation?

Standard distillation simply reduces the number of denoising steps, which often collapses the audio‑visual alignment because the model no longer has enough intermediate guidance. Lip Forcing, by contrast, inserts a sync‑window where CFG is active and schedules the two inference steps to land exactly at the trajectory point that maximises synchronization, preserving both speed and fidelity.

**Figure 1.** **Lip Forcing.** A streaming model for real-time lip synchronization that produces photorealistic, accurately lip-synced video at up to 31 FPS with low latency and memory. *Right:* both student scales lie on the throughput–FVD Pareto frontier, ahead of prior diffusion lip-sync methods.

The latency bottleneck in diffusion‑based lip‑sync stems from full‑sequence attention and many denoising steps; Lip Forcing removes this bottleneck with a two‑step causal pipeline.

Related Work

Introduces prior lip‑sync models and the deterministic rectified‑flow foundation for our approach.

This section surveys prior lip‑sync models and introduces the deterministic rectified‑flow framework that underpins our approach.

A deterministic transport that moves a noisy sample toward the data point along a straight line, making sampling a simple velocity integration.

During training the student feeds its own predictions back as context, aligning training and inference dynamics.

A GAN‑based audio‑driven lip‑sync system that takes a static face image and an audio waveform to generate a talking video.

A video‑to‑video approach that edits an existing video to match a target audio while preserving identity, pose, and background.

A multimodal talking‑head system that combines audio, text, and facial expression cues to drive a 3D avatar.

The Lip Forcing Framework

Method details the two‑step causal student, schedule, and reward that enable real‑time lip sync.

Diffusion‑based lip‑sync models require dozens of denoising steps, making real‑time streaming infeasible. The method therefore isolates the minimal trajectory portions that matter for timing and fidelity, and compresses them into a two‑step causal student.

By inspecting the teacher’s full denoising path we discover where guidance helps sync and where it harms visual fidelity.

How does the teacher’s trajectory guide the student’s schedule?

The analysis tells us that early steps already give a usable mouth shape, so the student can skip most intermediate denoising. The middle band (steps 20‑40) is where CFG is most beneficial for sync, so we place the second student call there to recover the missing articulation.

Apply CFG only during the middle part of the denoising trajectory where it boosts sync, and keep the rest CFG‑free to preserve visual fidelity.

Step 0 (pure noise) → no‑CFG ($s=1.0$) → coarse face.

Step 3 enters the window → CFG ($s=4.5$) → sync improves.

Step 5 exits the window → back to no‑CFG, preserving the fidelity gained earlier.

Step 9 reaches the final clean frame.

The window concentrates guidance where it matters most, avoiding the fidelity loss that a constant high CFG scale would cause.

Why not apply CFG at every step if it improves sync?

Constant high CFG amplifies the teacher’s prior, which blurs fine visual details (higher LPIPS). By restricting CFG to the sync‑favoring window we keep the early‑step fidelity and still gain the sync boost where it is most effective.

The causal student denoises a chunk twice: first from pure noise, then from a mid‑trajectory landing that balances fidelity and sync.

Why is step 30 chosen as the landing point?

Step 30 lies in the middle of the sync‑favoring band identified in the trajectory analysis, giving the best compromise: enough denoising to retain visual fidelity while still leaving a manageable sync gap that the reward can close.

Weight the DMD loss by a sync confidence score so the student is explicitly encouraged to align mouth movements with the audio.

Does the reward change the teacher’s guidance?

No. The reward is applied after the DMD loss as a multiplicative factor with stop‑gradient, so only the student’s parameters are updated; the teacher, CFG schedule, and SyncNet remain frozen.

Sample a supervision call index $j^*\in\{0,30\}$.

For each chunk $i=1\ldots N$, initialise $x_{\tau_0}^i\sim\mathcal{N}(0,I)$.

Iterate over the two student calls $j\in J_{\text{LF}}$ in order.

If $j=j^*$, enable gradients, generate $\hat{x}_0^i=G_{\theta}(x_{\tau_j}^i;\tau_j,\text{KV},c_i)$, cache the clean latent’s KV via $G_{\theta}^{KV}$, and store $\hat{x}_0^i$.

Otherwise, run the student without gradients, add Gaussian noise $\epsilon$, and compute the next noisy state $x_{\tau_{j'}}^i=(1-\tau_{j'})\hat{x}_0^i+\tau_{j'}\epsilon$.

After processing all chunks, re‑noise the collected clean latents for DMD, look up the sync‑window CFG scale $s_{\text{SW}}$, compute teacher and fake scores, and form the reward weight $w$.

Update the student parameters with $w\cdot\nabla_{\theta}\mathcal{L}_{\text{DMD}}$.

**Figure 2.** **Trajectory analysis of the 14B teacher.** Bands are $\pm 1$ SE. (a) **CFG fidelity–sync tradeoff:** CFG ($s=4.5$, red) improves Sync-C but worsens reference fidelity (LPIPS), while no-CFG ($s=1.0$, navy) shows the opposite trend. (b) **Euler-step $2 \times 2$ factorial** over schedules ($s_0, s_1$), plotted against the second-step landing $j_1$: mixed schedules recover most of the sync gap of the CFG-guided ceiling at landings near step 30. Full 4-metric versions in App. C.2.

**Figure 3.** Why few-step distillation needs trajectory-level care. Two HDTF [52] samples, each showing the 1-step prediction from pure noise, 50-step ODE final output, and ground truth, respectively. Even a one-step prediction preserves coarse facial structure and approximate mouth timing, but it loses the fine articulation and audio-visual synchronization recovered by the full 50-step teacher. Lip Forcing compresses this gap with a two-step student via the trajectory analysis of Sec. 4.2.

**Figure 5.** **Architecture of Lip Forcing.** The causal student denoises Gaussian noise with lip-sync conditions, producing a chunk-wise causal rollout via the two-step schedule (Sec. 4.4). The clean prediction $\hat{x}_0$ is supervised by the DMD [48, 47] gradient (Eq. 4) between a frozen 14B teacher and a trainable fake-score critic, with the teacher's CFG gated by the windowed schedule $s_{SW}$ of Eq. 6. The same $\hat{x}_0$ is decoded by the frozen Tiny AutoEncoder (TAE) [2] and scored by frozen SyncNet [6] against the conditioning audio to form the reward weight $\exp(\beta R)$ on the generator gradient (Eq. 7).

Experimental Results

Lip Forcing delivers real‑time quality while beating prior diffusion baselines on fidelity.

We evaluate Lip Forcing on the HDTF benchmark, measuring visual fidelity, identity preservation, lip‑sync quality, and streaming performance.

Lip Forcing (14 B) runs 39.8× faster than the same‑scale OmniAvatar‑LS teacher while preserving high visual fidelity.

Table 1 shows the 14 B student achieving 39.8× speedup over the 14 B teacher.

Table 2 isolates the contribution of the windowed CFG schedule and the SyncNet reward. Switching from a static CFG (4.5) to the windowed schedule improves FVD while incurring a modest Sync‑C drop; adding the reward restores Sync‑C without harming FVD.

Table 4 explores step‑count variants under the windowed CFG schedule (no reward). The 4‑step configuration yields the best FVD (118.86) at the cost of additional inference steps, while the 2‑step Lip Forcing (j₁ = 30) closes most of the gap with half the compute.

Table 5 sweeps the second‑step landing index j₁. Early landings (j₁ = 13) are sub‑optimal on both sync and fidelity; j₁ = 25 maximizes Sync‑C at a slight FVD increase, while j₁ = 37 minimizes FVD but hurts sync. The chosen j₁ = 30 offers a balanced trade‑off.

The user study (Table 6) confirms that despite a modest Sync‑C gap, Lip Forcing (14 B) is perceived as higher quality, better identity preservation, and more natural than all baselines, matching the best sync rating (X‑Dub) while surpassing it on the other axes.

**Table 1.** Main comparison on HDTF [52]. Quality, sync, and identity metrics across baselines and Lip Forcing at two scales; TTFF in milliseconds. Best values **bold**; second-best <u>underlined</u>.

**Figure 6.** Qualitative comparison on HDTF. Each row shows the same source frame rendered by our method, six lip-sync baselines, and the ground truth (GT) at the moment of articulating the bracketed English phoneme. Best viewed zoomed in and in color.

**Table 6.** **User study.** Mean Opinion Score (1–5 Likert, higher is better) on four axes: video–audio synchronization (Sync), video quality (Qual.), identity preservation (ID), and naturalness (Nat.). Best per column bold; second-best underlined.

Teacher Model Details

Details the teacher diffusion model that guides the distilled student.

This appendix details the teacher diffusion model that supervises the distilled student introduced earlier.

OmniAvatar is an audio‑driven portrait animation system that turns a single reference face and an audio clip into a short speaking video.

OmniAvatar‑LS repurposes OmniAvatar for lip‑sync by treating the mouth region as an inpainting target while keeping the rest of the video untouched.

The teacher and student share an identical preprocessing pipeline that extracts high‑quality, face‑aligned video‑audio pairs for inpainting‑based lip‑sync.

Finetuning OmniAvatar‑LS uses LoRA adapters, a constant learning rate, and a multi‑component loss that emphasizes accurate mouth motion.

Trajectory analysis (Section 4.2) runs the 14 B OmniAvatar‑LS teacher on ten held‑out Hallo3 clips, using identical noise seeds, reference frames, and CFG variants so that per‑step trajectories can be compared as paired samples.

Trajectory Analysis Details

We examine how the shifted‑ODE schedule and related ablations affect lip‑sync fidelity and synchronization.

The ablation study isolates the impact of the shifted‑ODE schedule and related design choices on the lip‑sync quality metrics.

The schedule concentrates inference steps in the high‑noise region, where the lip‑sync trajectory is most volatile, thereby giving the student model more opportunity to correct errors.

How does the shifted schedule differ from a simple uniform schedule?

In a uniform schedule each step advances the noise level linearly, giving equal resolution across the whole trajectory. The shifted schedule warps this progression so that early, high‑noise steps are denser (more updates where the lip‑sync signal changes rapidly) while later, low‑noise steps are sparser, which improves the student’s ability to track the fast‑changing part of the trajectory.

**Figure 7.** The CFG fidelity–sync tradeoff (full 4-metric). Per-step mean across $n=10$ samples; shaded bands are $\pm 1$ standard error. Red: CFG-guided teacher ($s=4.5$); navy: no-CFG teacher ($s=1.0$). SSIM (mouth) tracks LPIPS, and Sync-D mirrors Sync-C: the same separation between the two trajectories observed in the main figure is reproduced on these additional metrics.

**Figure 8.** **Euler-step 2 × 2 factorial (full 4-metric).** Per-step mean across n=10 samples; shaded bands are ±1 standard error. Each trace is one cell of ($s_0$, $s_1$). The reference-axis pattern (cells sharing $s_0$ converge by mid-trajectory) holds on SSIM as well as LPIPS; the sync-axis pattern (single-CFG cells close most of the gap to CFG→CFG around step 30, then diverge outside the mid-trajectory window) holds on Sync-D as well as Sync-C.

**Figure 9.** Fixed-CFG endpoints vs. schedule operating point (full 4-panel). Step-49 endpoints of fixed-CFG sweeps at $s \in \{1.0, 3.0, 4.5, 6.0\}$ as open circles; the no-CFG$\rightarrow$CFG Euler-step operating point at $j=30$ as a filled green diamond. Both axes carry $\pm 1$ SE error bars on $n=10$ samples. Axes are oriented so up-right is favorable (LPIPS, Sync-D inverted). The Sync-D panels (the right column) tell the same story as the Sync-C panels reproduced in the main paper.

**Figure 10.** **CFG fidelity–sync tradeoff, audio-only drop mode.** Audio-only counterpart of main Fig. 2(a) (full 4-metric counterpart in Fig. 7). Per-step means on the mouth region across $n=10$ samples; shaded bands are $\pm 1$ standard error. Red: $s=4.5$ with audio-only drop. Navy: $s=1.0$ (drop mode irrelevant when guidance scale is 1.0). The same direction of separation holds across all four metrics under audio-only drop.

**Figure 11.** Euler-step CFG factorial, audio-only drop mode. Audio-only counterpart of main Fig. 2(b) (full 4-metric counterpart in Fig. 8). Per-step means on the mouth region across $n=10$ samples; shaded bands are $\pm 1$ standard error. Same four cells $(s_0, s_1)$ as the main paper: $s_0$ drives the velocity from noise; $s_1$ is used at the re-evaluated landing. Both axes of separation persist: cells sharing $s_0$ converge on SSIM and LPIPS by mid-trajectory; both single-CFG cells (green, purple) close most of the sync gap to the CFG $\rightarrow$ CFG ceiling around landings near step 30, and diverge outside this window.

**Figure 12.** **Trajectory plateau zoom around the joint reference-sync optimum.** Per-step means on the mouth region across $n=10$ samples; shaded bands are $\pm 1$ standard error. Same four Euler-step cells $(s_0, s_1)$ as main Fig. 2(b), restricted to landing steps $j_1 \in [15, 45]$. The plateau region $j_1 \in [25, 32]$ is shaded gold. Within the plateau, the no-CFG $\rightarrow$ CFG cell (green) achieves Sync-C and Sync-D close to the CFG $\rightarrow$ CFG ceiling (red) while keeping mouth-LPIPS close to the no-CFG $\rightarrow$ no-CFG floor (navy) — the recipe's joint optimum.

Training and Inference Details

This section details the training hyperparameters, schedules, architecture, and rollout procedures underlying Lip Forcing.

Both training stages use AdamW (weight decay $0.01$) with gradient norm clipping at $10.0$, run in bf16 mixed precision on four NVIDIA H200 GPUs and an effective batch size of $64$ via gradient accumulation.

Stage 1 (Diffusion Forcing pretraining) runs $5\text{K}$ steps at learning rate $10^{-5}$ with a $1\text{K}$‑step linear warm‑up; Stage 2 (Self Forcing DMD distillation) runs $600$ steps at learning rate $2\times10^{-6}$ for both student and fake‑score critic, using $\beta_{1}=0$ to disable momentum while keeping $\beta_{2}=0.999$.

During Stage 1 the student is supervised with a block‑wise inhomogeneous timestep schedule: each three‑latent‑frame chunk is noised at a single timestep $\tau_{j}$ drawn uniformly from the shifted‑ODE grid $\{\tau_{j}\}_{j\in J_{LF}}$, e.g., $\{ \tau_{0},\tau_{30}\}$ for $J_{LF}=(0,30)$, so chunks are independent while frames inside a chunk share the same noise level.

Stage 2 operates on the student’s self‑causal rollout; at each iteration the student produces a clean prediction $\hat{X}_{\theta}=\{\hat{x}_{i}\}_{i=1}^{N}$ via a $K=2$‑call few‑step schedule $J_{LF}=(0,30)$. A continuous timestep $t\sim q(t)$ is drawn from $[0.001,0.999]$, and the latent is re‑noised as $x_{t}=(1-t)\hat{X}_{\theta}+t\,\epsilon$ with fresh $\epsilon\sim\mathcal{N}(0,I)$. The teacher is queried with CFG when $j(t)\in[20,40]$ (shifted‑t band $[0.555,0.882]$) and without CFG otherwise.

The fake‑score critic is initialized from OmniAvatar release weights at the matching student scale and is trained without classifier‑free guidance at any timestep.

The 1.3 B student is fully fine‑tuned; the 14 B student uses LoRA (rank $128$, $\alpha=64$) on attention and feed‑forward layers while audio‑conditioning projections and patch embeddings are fully fine‑tuned. Inference proceeds chunk‑wise (three latent frames per chunk) with the first frame fixed as an attention sink and a six‑frame rolling context window; each chunk takes $K=2$ denoising steps at $J_{LF}=(0,30)$ and no CFG.

Training compute totals roughly $1{,}900$ H200‑GPU‑hours for the reported runs; accounting for exploratory experiments roughly doubles the project‑level compute to about $3{,}800$ H200‑GPU‑hours.

The SyncNet reward multiplies the generator gradient by $w(\hat{x}_{0})=\exp(\beta\cdot R(D(\hat{x}_{0}),a))$ with $\beta=2$, where $R$ is the raw SyncNet confidence between conditioning audio $a$ and the decoded visual frame $D(\hat{x}_{0})$; the Tiny AutoEncoder decoder is used for $D$ to keep latency low.

During streaming rollout each chunk denoises three latent frames in parallel, writes its KV cache entries, and evicts the oldest non‑sink entries; temporal RoPE is applied based on each entry’s position within the rolling cache, preserving the training‑time sink‑to‑window gap.

Algorithm 1 (not reproduced here) outlines the full Lip Forcing training iteration, extending the Self Forcing algorithm with the windowed teacher schedule $s_{SW}$, the SyncNet reward weighting $w$, and the analysis‑derived two‑step student schedule $J_{LF}=(0,30)$; fake‑score updates follow the standard DMD pipeline at a $5\!:\!1$ student‑to‑critic ratio.

Efficiency and Compute Methodology

Concrete details of hardware, training loop, benchmarks, and user study.

All efficiency numbers are obtained on a single NVIDIA H100 80 GB GPU. Throughput and TTFF are measured from the first VAE encode to the end of the first chunk’s last VAE decode, excluding audio preprocessing, face detection, and any post‑decode compositing.

Sample a supervision call index $j^*$ uniformly from $J_{LF} = (0,30)$.

For each chunk $i=1\ldots N$, initialise a latent $x_i \sim \mathcal{N}(0,I)$ and run the two‑step schedule up to $j^*$, caching KV pairs from the KV‑returning generator $G_{KV}$.

At the selected step $j^*$, enable gradients, generate $\hat{x}_{i,0}=G_\theta(x_i;\tau_j,KV,c_i)$, store it, then disable gradients and update the KV cache.

For earlier steps ($j<j^*$) run the generator without gradients, add Gaussian noise $\epsilon$, and compute the interpolated latent $x_{i,\tau'_j}=(1-\tau'_j)\hat{x}_{i,0}+\tau'_j\epsilon$.

After processing all chunks, sample a DMD timestep $t$, re‑noise the rollout, look up the CFG scale $s_t$, compute teacher and fake scores, weight the reward, and update $\theta$ with the weighted gradient of $L_{DMD}$.

All baselines (Wav2Lip [33], Video‑ReTalking [5], Diff2Lip [31], MuseTalk [51], LatentSync [25], X‑Dub [14]) are run from their publicly released code and checkpoints using default inference settings; no architectural or weight modifications are applied.

**Figure 14.** Throughput–FVD Pareto frontier across all baselines on HDTF. Companion to the diffusion-only chart in the main paper (Fig. 1): adds the single-pass methods Wav2Lip [33], Video-ReTalking [5], and MuseTalk [51] that are excluded from the main-body diffusion-only comparison. Self Forcing and the ground-truth row are still omitted; the FVD axis is inverted so the up-right corner is the best Pareto position. Vertical dotted line: 25-FPS playback rate; dashed line: Pareto frontier. Lip Forcing (14B) achieves the lowest FVD on the chart (107.88), while Wav2Lip's frontier position is throughput-only – its FVD (384.82) is ~ 3.5x that of Lip Forcing (14B), which is why the main-paper figure restricts the comparison to diffusion-based peers.

The user study follows a self‑hosted MOS protocol on 30 HDTF [52] and TalkVid [4] clips. Each rater sees ten pages (five HDTF, five TalkVid) with the ground‑truth video and three anonymised model outputs (A, B, C) randomly assigned to methods.

Raters provide four 5‑point Likert scores per output: Video‑Audio Synchronization, Video Quality, Identity Preservation, and Naturalness, yielding 30 model evaluations per rater across the four axes.

Beyond the main HDTFshort benchmark, we evaluate on HDTFlong (full‑length videos up to 6 min), Hallo3 (30 out‑of‑domain clips), and TalkVid (30 self‑driven clips). All clips undergo the same alignment pipeline (face detection, InsightFace [9] affine alignment, 512×512 crop) and are re‑inserted into the original frames before metric computation.

Table 8 (Hallo3) shows Lip Forcing (14 B) achieving the best FVD, SSIM, and CSIM scores, while LatentSync leads on FID and Diff2Lip on Sync‑D. Table 9 (TalkVid) reports Lip Forcing (14 B) as best on FVD, FID, SSIM, and CSIM, with Diff2Lip and LatentSync still excelling on the sync metrics.

Section E.4 extends the evaluation to long‑video rollouts, confirming that the two‑step causal student maintains temporal stability over minutes‑long sequences.

Long-Video Evaluation

Long‑video and cross‑identity tests show Lip Forcing outperforms all baselines.

Lip Forcing (14B) reduces FVD by 253.5 points compared to Wav2Lip on HDTF long videos.

Wav2Lip records 372.47 FVD while Lip Forcing (14B) records 118.97 FVD.

Beyond the headline FVD, Lip Forcing (14B) also ranks near‑top on identity and sync metrics, while LatentSync dominates FID, SSIM, Sync‑D and Sync‑C, and MuseTalk leads CSIM.

**Table 10.** Long-video evaluation on HDTF. Quality, identity, and sync metrics on `HDTF_long` clips up to 6 minutes in duration. Best values bold; second-best underlined.

Qualitative long‑video samples (Fig. 15) demonstrate that Lip Forcing maintains visual fidelity, identity, and background consistency across a 3‑minute rollout, whereas the X‑Dub baseline quickly shows over‑saturation and identity drift.

**Figure 15.** Long-video qualitative results on `HDTF_long`. Two identities, each rolled out to t=180 s and sampled every 30 s, comparing ground truth, Lip Forcing, and the strongest baseline X-Dub at consistent timestamps. Frame quality, identity, and background remain stable across the full 3-minute rollout under Lip Forcing’s causal AR streaming, well beyond the 81-frame (~3.24 s) training chunk.

Read the original paper

Open the simplified reader on Paperglide