Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, Robin Rombach

Stable Video Diffusion (SVD) uses systematic data curation and a three-stage training pipeline to achieve state-of-the-art video generation.

How can we systematically curate large-scale video datasets to effectively train high-resolution latent video diffusion models?

Generative video models lack a unified training strategy, and the impact of data selection on video synthesis remains largely unexplored compared to image modeling. The authors introduce a three-stage training pipeline—image pretraining, large-scale video pretraining, and high-quality video finetuning—supported by a systematic data curation workflow that filters for aesthetic quality, motion, and caption alignment. This approach produces a base model that serves as a powerful motion prior, outperforming existing methods in text-to-video, image-to-video, and multi-view 3D generation.

Paper Primer

The core mechanism hinges on separating video training into distinct stages: initializing from a 2D image model, pretraining on a large, curated video dataset to learn general motion, and finetuning on a smaller, high-fidelity dataset. The authors demonstrate that curation—filtering out static scenes, watermarks, and low-quality clips—is essential for performance gains that persist through the final finetuning stage.

SVD significantly outperforms prior state-of-the-art video generation models on zero-shot text-to-video benchmarks.

Evaluation on the UCF-101 dataset shows an FVD score of 242.02, compared to 355.20 for the next best baseline (PYOCO). A substantial improvement in temporal consistency and visual quality as measured by Fréchet Video Distance (FVD).

The model acts as a strong 3D prior, enabling efficient multi-view generation from a single image.

Finetuning SVD on multi-view datasets (Objaverse, MVImgNet) achieves competitive results in 16 hours of training, whereas specialized models like SyncDreamer require four days. Outperforms image-based methods at a fraction of the compute budget.

Why is a three-stage training process necessary instead of just training on high-quality data from the start?

The authors find that large-scale video pretraining is required to learn a general motion representation, while high-quality finetuning is necessary to achieve high visual fidelity; skipping the pretraining stage results in lower-ranked performance in human preference studies.

What is the scope of this model—can it be adapted for specific types of motion?

Yes, the model supports explicit motion control by training Low-Rank Adaptation (LoRA) modules on specific subsets of data, such as those categorized as "zooming" or "horizontally moving," which can be plugged into the temporal attention blocks.

Researchers should prioritize systematic data curation and multi-stage training pipelines over architectural tweaks alone. SVD provides a robust, reusable motion prior that drastically lowers the compute barrier for downstream tasks like multi-view 3D synthesis.

Introduction to Stable Video Diffusion

We introduce Stable Video Diffusion, a high‑resolution video generator built on latent diffusion and systematic data curation.

Stable Video Diffusion (SVD) is presented as a latent video diffusion model that produces high‑resolution, state‑of‑the‑art text‑to‑video and image‑to‑video results. Recent work has repurposed image diffusion models for video by adding temporal layers and fine‑tuning on small video sets, but training recipes remain fragmented.

The primary bottleneck for training high‑quality video diffusion models is the raw video data: noisy, static, or caption‑less clips degrade generation quality. Systematic filtering of such content—removing static scenes, cuts, and overlaid text—yields a markedly stronger base model.

It treats a video as a sequence of latent frames and learns to denoise them jointly, extending the image‑level diffusion process into the temporal dimension.

We identify three crucial training stages: (i) text‑to‑image pre‑training to acquire strong spatial priors, (ii) video pre‑training on a massive low‑resolution dataset to learn motion, and (iii) high‑resolution finetuning on a curated high‑quality set to sharpen details.

A systematic data‑curation pipeline—referred to as Cascaded Curation—filters out static shots, abrupt cuts, and overlaid subtitles before captioning, producing a clean video corpus for pre‑training.

The resulting base model encodes a general motion and multi‑view prior, which can be leveraged for downstream tasks such as image‑to‑video generation and camera‑motion‑specific LoRA adaptation.

Fine‑tuning this base on a smaller, high‑quality video set yields a multi‑view diffusion model that generates consistent object views in a single forward pass, surpassing specialized novel‑view synthesis methods while using far less compute.

Additionally, lightweight LoRA modules trained on motion‑specific datasets can be plugged into the temporal layers, granting explicit control over motion cues without retraining the full model.

SVD achieves state‑of‑the‑art video synthesis through systematic data curation.

**Figure 1.** Stable Video Diffusion samples. Top: Text-to-Video generation. Middle: (Text-to-)Image-to-Video generation. Bottom: Multi-view synthesis via Image-to-Video finetuning.

Related Work

Training latent video diffusion models needs data; we apply a curation pipeline to filter scenes, cuts, and text.

This section surveys prior work on latent video diffusion and the data‑curation practices that underpin modern video generators.

Generative models that operate in a compressed latent space, reducing computation compared with pixel‑space diffusion.

Layers that inject temporal information into a spatially pretrained model, typically via convolutions or attention.

Noise that varies smoothly over time, encouraging temporally consistent generations.

Full‑model fine‑tuning where every spatial convolution and attention layer is followed by a temporal counterpart.

Two extremes: updating only temporal modules versus using a frozen diffusion backbone.

Conditioning the diffusion model either directly on a text prompt or via an auxiliary text‑to‑image prior.

Injecting the desired output frame rate as an additional conditioning signal.

Adopts the EDM diffusion formulation and shifts the noise schedule toward higher noise levels during high‑resolution finetuning.

Datasets containing billions of image‑text pairs that enable strong representations for downstream tasks.

Using CLIP similarity scores to filter noisy image‑text pairs before training generative models.

A publicly available 10‑million‑clip video collection widely used despite watermarks and suboptimal size.

A systematic pipeline: (1) pretrain on large curated video data, (2) fine‑tune with high‑noise EDM schedule, (3) adapt to downstream tasks via micro‑conditioning.

Cascaded Data Curation Pipeline

How we filter raw video data into a high‑quality training set

Raw video collections are riddled with abrupt cuts, static shots, and on‑screen text, all of which corrupt the learning signal for diffusion models.

We run a sequence of cheap, coarse filters followed by finer, more expensive ones; each stage removes a specific class of undesirable clips, so the final corpus is both larger and cleaner than the raw input.

How does Cascaded Data Curation differ from a single‑stage cut detector?

A single‑stage detector runs at one temporal resolution, so it either misses gradual transitions (if the rate is low) or wastes compute on the whole video (if the rate is high). The cascade first removes obvious cuts cheaply, then re‑examines the survivors at higher rates, catching subtle edits without processing every frame of every clip.

1 fps scan: compare frame pairs (0‑1, 2‑3, 4‑5, 6‑7). Only the abrupt cuts at 2‑3 and 5‑6 exceed the threshold, yielding two detected cuts.

2 fps scan on the remaining segments: compare frames (0‑2, 1‑3, 4‑6, 5‑7). The slow fade between 4‑5 now spans two frames (4‑6) and exceeds the higher‑resolution threshold, adding one more cut.

4 fps scan on the sub‑segment 4‑5: compare frames (4‑5) directly; the fade is confirmed, but no new cuts appear.

Union of detections: three cuts total (two abrupt, one gradual). The cascade thus recovers a transition missed by the 1 fps pass.

Resulting clip count: original 1 clip → 4 clips after splitting at the three cut points, a four‑fold increase matching the paper’s reported gain.

The cascade recovers subtle transitions without incurring the full cost of high‑rate scanning on the entire dataset, turning a noisy video pool into a richer set of training clips.

Stage I – Image pretraining: initialize the video diffusion backbone with a 2D text‑to‑image diffusion model (Stable Diffusion 2.1) to obtain strong spatial features.

Stage II – Video pretraining: train on the curated dataset (LVD‑F) produced by Cascaded Data Curation, using the same diffusion objective as the image model.

Stage III – High‑quality finetuning: fine‑tune the pretrained video model on a small, high‑resolution subset (≈250 K clips) that passed all curation filters, increasing resolution and training steps.

**Figure 2.** Our initial dataset contains many static scenes and cuts which hurts training of generative video models. *Left:* Average number of clips per video before and after our processing, revealing that our pipeline detects lots of additional cuts. *Right:* We show the distribution of average optical flow score for one of these subsets before our processing, which contains many static clips.

By chaining these inexpensive filters we obtain a curated video corpus that is both larger and cleaner, enabling the subsequent training stages to learn richer motion dynamics without being distracted by artefacts.

Scaling and Model Capabilities

Training at scale shows curated data dramatically improves video diffusion performance.

Our base model achieves state‑of‑the‑art zero‑shot text‑to‑video performance on UCF‑101, with an FVD of 242.02, beating prior methods by a large margin.

Table 2 shows 242.02 FVD versus 355.20 for the next best baseline.

4.1 Pretrained Base Model – We fine‑tune a Stable Diffusion 2.1 backbone with a shifted noise schedule, first on 256 × 384 frames (150 k steps, batch 1536) and then on 320 × 576 frames (100 k steps, batch 768). The resulting model learns a powerful motion representation and dominates baselines on zero‑shot UCF‑101.

4.2 High‑Resolution Text‑to‑Video – Using a curated 1 M‑sample video set (rich motion, steady camera, aligned captions) we fine‑tune the base model for 50 k steps at 576 × 1024 resolution, again shifting the noise schedule toward more noise.

4.3 High‑Resolution Image‑to‑Video – We replace the text encoder with a CLIP image encoder and concatenate a noise‑augmented copy of the conditioning frame channel‑wise to the UNet input. Two variants are trained (14‑frame and 25‑frame outputs) with linearly increasing classifier‑free guidance across the temporal axis.

Instead of random weights, we initialise the spatial convolutional layers from a pretrained image diffusion model, giving the video network a strong visual prior from the start.

How does spatial layer initialization differ from the usual random init used in video models?

Random init forces the network to learn low‑level visual features from scratch, whereas copying pretrained image weights supplies a ready‑made representation of textures, edges, and colors, letting the training budget be spent on temporal dynamics.

Low‑rank adapters (LoRAs) are inserted into the temporal attention blocks; each LoRA is trained on a small subset of videos annotated with a specific camera motion type (horizontal, zooming, static).

Why not just train separate video models for each camera motion instead of using LoRAs?

Separate models would multiply training cost and storage; LoRAs share the bulk of the network and only add a small set of motion‑specific parameters, enabling on‑the‑fly switching without retraining.

We fine‑tune the video diffusion model on multi‑view datasets so that, given a single input image, it learns to generate a sequence of novel viewpoints that remain geometrically consistent.

How does multi‑view finetuning differ from training a dedicated multi‑view generator from scratch?

Starting from a video‑diffusion prior means the model already knows how to synthesize temporally coherent frames; finetuning merely teaches it to respect camera pose changes, which converges in orders of magnitude fewer steps than training a fresh model.

4.4 Frame Interpolation – By concatenating left and right frames as masked inputs to the UNet, the model learns to predict three intermediate frames, effectively quadrupling the frame rate after only ~10 k fine‑tuning steps.

4.5 Multi‑View Generation – We train SVD‑MV on Objaverse (150 K curated objects) and MVImgNet (≈ 200 K videos). Evaluation on 50 unseen GSO objects shows superior PSNR, LPIPS, and CLIP‑S scores compared to image‑prior (SD 2.1‑MV) and scratch baselines, while requiring far less compute than SyncDreamer.

**Figure.** (a) Initializing spatial layers from pretrained images models greatly improves performance. (b) Video data curation boosts performance after video pretraining.

**Figure 4.** Summarized findings of Sections 3.3 and 3.4: Pretraining on curated datasets consistently boosts performance of generative video models during video pretraining at small (Figures 4a and 4b) and larger scales (Figures 4c and 4d). Remarkably, this performance improvement persists even after 50k steps of video finetuning on high quality data (Figure 4e).

**Figure 5.** Samples at 576 × 1024. Top: Image-to-video samples (conditioned on leftmost frame). Bottom: Text-to-video samples.

**Figure 6.** Our 25 frame Image-to-Video model is preferred by human voters over GEN-2 [74] and PikaLabs [54].

**Figure 7.** Applying three camera motion LoRAs (horizontal, zooming, static) to the same conditioning frame (on the left).

**Figure 8.** Generated multi-view frames of a GSO test object using our SVD-MV model (i.e. SVD finetuned for Multi-View generation), SD2.1-MV [72], Scratch-MV, SyncDreamer [58], and Zero123XL [14].

**Figure 9.** (a) Multi-view generation metrics on Google Scanned Objects (GSO) test dataset. SVD-MV outperforms image-prior (SD2.1-MV) and no-prior (Scratch-MV) variants, as well other state-of-the-art techniques. (b) Training progress of multi-view generation models with CLIP-S (solid, left axis) and PSNR (dotted, right axis) computed on GSO test dataset. SVD-MV shows better metrics consistently from the start of finetuning.

**Figure 10.** Generated novel multi-view frames for MVImgNet dataset using our SVD-MV model, SD2.1-MV [72], Scratch-MV.

**Figure 17.** Results of the dedicated experiments conducted to identify most useful filtering thresholds for each ablation axis. For of these ablation studies we train four identical models using the architecture detailed in App. E.2.2 on different subset of LVD-10M, which we create by systematically increasing the thresholds which corresponds to filter out more and more examples.

**Figure 18.** Additional Text-to-Video samples. Captions from top to bottom: “A hiker is reaching the summit of a mountain, taking in the breathtaking panoramic view of nature.”, “A unicorn in a magical grove, extremely detailed.”, “Shoveling snow”, “A beautiful fluffy domestic hen sitting on white eggs in a brown nest, eggs are under the hen.”, and “A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Vincent van Gogh”.

**Figure 19.** Additional Image-to-Video samples. Leftmost frame is use for conditioning.

**Figure 20.** Additional Image-to-Video samples with camera motion LoRAs (conditioned on leftmost frame). The first, second, and third rows correspond to horizontal, static, zooming, respectively.

**Figure 21.** Text-to-video samples using the prompt “Flowers in a pot in front of a mountainside” (for spatial cross-attention). We adjust the camera control by replacing the prompt in the temporal attention using “”, “panning”, “rotating”, and “zooming” (from top to bottom). While not being trained for this inference task, the model performs surprisingly well.

**Figure 22.** Additional image-to-multi-view generation samples from GSO test dataset, using our SVD-MV model trained on Objaverse, and comparison with other methods.

**Figure 23.** Additional image-to-multi-view generation samples from GSO test dataset, using our SVD-MV model trained on Objaverse

**Figure 24.** Text-to-image-to-multi-view generation samples: text to image using SDXL with the prompt "Centered 3D model of a cute anthropomorphic sunflower figure (plain background, unreal engine render 4k)", and image-to-multi-view using our SVD-MV model trained on Objaverse

**Figure 25.** Additional multi-view generation samples from MVI dataset, using our SVD-MV model trained on MVImgNet, and comparison with other methods. Top row is ground truth frames, second row is sample frames from SVD-MV (ours), third row is from SD2.1-MV, bottom row is from Scratch-MV

Curated data consistently boosts performance across both small and large model scales.

Ablation Studies on Data Filtering

We filter raw video clips to remove cuts, static scenes, and text, then assess motion and aesthetics to improve training data quality.

Building on the large‑scale training pipeline, we now describe the data‑processing steps that turn raw video archives into a high‑quality training set.

A three‑stage detector runs at different frame rates and thresholds to catch both abrupt cuts and gradual fades.

Why not just use a single cut detector with a low threshold?

A low threshold would flag ordinary motion as cuts, producing many spurious clips and wasting storage; the cascade separates fast and slow changes to maintain precision.

Clips are cut at the nearest keyframe timestamps, ensuring no cut falls inside a clip and allowing fast seeking.

Would clipping at arbitrary frames cause visual artifacts?

Yes—cutting mid‑GOP can corrupt macroblock boundaries, leading to decoding errors; keyframe alignment prevents such artifacts.

We compute dense flow at $2$ fps, downscale to a $16$ px shortest side, and average over time and space to obtain a single motion score per clip.

Why not compute flow at full resolution for the whole dataset?

Full‑resolution flow would explode storage and compute costs; the coarse representation provides a good trade‑off for the massive pre‑training corpus.

For the finetuning stage we recompute flow with RAFT at $800\times450$ resolution, then aggregate as before to obtain precise motion scores.

Could we use the coarse flow for finetuning as well?

Coarse flow may overlook subtle motions, leading to residual static clips in the finetuning set and degrading final video quality.

Three captions per clip are generated: a spatial caption from CoCa, a temporal caption from VideoBLIP, and a merged caption from a lightweight LLM.

Why not rely on a single captioner?

A single model either ignores temporal cues (image‑only) or lacks fine‑grained spatial detail (video‑only); combining them yields richer, more accurate descriptions.

We compute CLIP embeddings for the first, center, and last frames of each clip and their captions; cosine similarity filters misaligned pairs, while an aesthetics predictor discards visually low‑quality frames.

Could we skip the aesthetics score and rely only on similarity?

Similarity alone does not guarantee sharp, well‑exposed frames; aesthetic scoring removes blurry or low‑contrast clips that would otherwise degrade generation quality.

CRAFT detects on‑screen text; clips whose cumulative text bounding‑box area exceeds $7\%$ of the frame are discarded to prevent the model from learning to reproduce text.

Why not discard any clip with detected text?

Minor text (e.g., logos) is often harmless; a strict filter would unnecessarily shrink the dataset, whereas the $7\%$ rule targets only heavily text‑laden videos.

The curated set, called LVD‑F, consists of clips that have passed all filters: cut‑aware clipping, motion scoring, synthetic captioning, CLIP similarity & aesthetics, and text detection.

Diffusion Model Framework

We detail the diffusion framework, modified preconditioning, and a linearly increasing guidance scheme.

Diffusion models treat generation as a reverse‑time stochastic process. The continuous‑time diffusion objective defines a probability‑flow ODE that gradually denoises a noisy latent $x_M\sim\mathcal{N}(0,\sigma_{\max}^2)$ toward the data distribution $p_{\text{data}}(x_0)$.

Training reduces to learning a score model $s_\theta(x;\sigma)$, typically instantiated as a denoiser $D_\theta$ that predicts the clean latent $x_0$ from a noisy input. Denoising score matching (DSM) samples a noise level $\sigma$ and isotropic noise $n\sim\mathcal{N}(0,\sigma^2)$, then minimizes a weighted $\ell_2$ loss $\lambda_\sigma\|D_\theta(x_0+n;\sigma)-x_0\|^2$.

Classifier‑free guidance blends a conditional denoiser $D(x;\sigma,c)$ with an unconditional counterpart $D(x;\sigma)$ using a scalar weight $w\ge0$, yielding $D_w(x;\sigma,c)=w\,D(x;\sigma,c)-(w-1)D(x;\sigma)$.

We replace the original EDM preconditioning functions of Stable Diffusion 2.1 with $c_{\text{skip}}(\sigma)=(\sigma^2+1)^{-1}$, $c_{\text{out}}(\sigma)=-\sigma/(\sigma^2+1)$, $c_{\text{noise}}(\sigma)=0.25\log\sigma$, and a learned $c_{\text{in}}(\sigma)$. The noise schedule follows $\log\sigma\sim\mathcal{N}(P_{\text{mean}},P_{\text{std}}^2)$ with $P_{\text{mean}}=-1.2$, $P_{\text{std}}=1$, and weighting $\lambda(\sigma)=(1+\sigma^2)\sigma^{-2}$.

Instead of a fixed guidance weight, we linearly ramp $w$ from a low value at the first frame to a high value at the last frame, preventing both under‑guidance (inconsistent frames) and over‑guidance (oversaturation).

Compute the scale tensor of shape $(2,4)$ by linearly interpolating between 1.0 and 2.0.

Reshape the latent tensor $x$ from shape $(8,\dots)$ to $(2,4,\dots)$.

For each frame $t$, multiply the conditional‑unconditional mix $D_w$ by the corresponding scale $w_t$.

Reshape back to $(8,\dots)$ for the next diffusion step.

The linear schedule preserves the relative ordering of frames while gradually strengthening the conditioning signal, which a constant $w$ cannot achieve.

How does this linear schedule differ from the standard constant‑weight classifier‑free guidance?

Standard guidance applies the same $w$ to every frame, so early frames may be under‑guided (leading to drift) while later frames can be over‑guided (causing saturation). The linear schedule assigns a low $w$ to early frames—allowing the model to explore diverse content—and a high $w$ to later frames—forcing alignment with the conditioning signal, thus balancing diversity and fidelity across time.

LinearPredictionGuider – implements the linearly increasing guidance schedule.

Interpolation models reduce the output frame count to five, insert a learned mask embedding $z_m$, and concatenate conditioning frames $z_s$, $z_e$ with $z_m$ to form a latent sequence of shape $(5,c,h,w)$. The UNet receives this sequence together with a binary mask indicating which slots contain real frames.

For multi‑view generation we condition on a single image and on the camera elevation. The elevation angle is encoded by a sinusoidal timestep embedding, concatenated to the UNet’s global conditioning vector, and the model is trained on 21‑frame orbits rendered from the Objaverse dataset.

Evaluation Methodology

We describe the human preference study, model training details, and extra evaluation experiments.

Human Preference Assessment gathers pairwise judgments on visual quality and prompt fidelity to rank models. It underpins most of the paper’s evaluation.

Human annotators compare two videos per prompt, voting on which looks better and which follows the text more faithfully.

In the experimental setup each model pair is evaluated 1 v 1 on every prompt, and we collect three votes per task from distinct annotators. The order of prompts and model ordering is fully randomized, and attention‑check items are interleaved to guarantee data quality.

Elo scores are computed by treating each model as a player in a zero‑sum game. For a match between players 1 and 2 the expected outcomes are $E_1 = \frac{1}{1 + 10^{(R_2 - R_1)/400}}$ and $E_2 = \frac{1}{1 + 10^{(R_1 - R_2)/400}}$. After observing the result $S_i$ (1 for win, 0 for loss) the ratings update as $R_i \leftarrow R_i + K \cdot (S_i - E_i)$ with $K=1$ and all models initialized at $R_{\text{init}} = 1000$.

All models share a temporal UNet built on the Stable Diffusion 2.1 2D‑UNet backbone. Temporal convolutions and (cross‑)attention layers are added after each spatial block, and the spatial weights are initialized from the pretrained 2D‑UNet except for the ablation shown in Figure 3a.

Filtering thresholds are calibrated per annotation type: we keep the top 75 % of motion examples, discard the bottom 25 % of aesthetic scores, and drop the 25 % of frames with the largest text area. These cuts shrink the LVD dataset by more than three‑fold while improving downstream quality.

Read the original paper

Open the simplified reader on Paperglide