Scalable Diffusion Models with Transformers

William Peebles, Saining Xie

Diffusion Transformers (DiTs) replace the standard U-Net backbone with a transformer to enable scalable image generation.

Can we replace the standard U-Net backbone in diffusion models with a Transformer to enable better scaling and performance?

Diffusion models have relied on convolutional U-Net architectures as their de-facto backbone, limiting their ability to benefit from the scaling trends seen in other domains. The authors introduce Diffusion Transformers (DiTs), which replace the U-Net with a standard Vision Transformer (ViT) architecture that operates on latent patches. By scaling model depth, width, and token count, DiTs achieve state-of-the-art FID scores on ImageNet benchmarks while remaining more compute-efficient than prior U-Net-based models.

Paper Primer

The core move is to treat the diffusion process as a sequence-modeling task by "patchifying" latent representations into tokens. The authors then apply a transformer backbone with an adaptive layer normalization (adaLN-Zero) mechanism, which initializes blocks as identity functions to stabilize training at scale.

DiTs demonstrate a strong positive correlation between model compute (Gflops) and sample quality (FID).

Scaling model size (S to XL) and decreasing patch size (8 to 2) consistently lowers FID across all training stages. The DiT-XL/2 model achieves a state-of-the-art FID of 2.27 on ImageNet 256×256, outperforming the previous best LDM result of 3.60.

DiT architectures are more compute-efficient than U-Net backbones.

DiT-XL/2 (118.6 Gflops) outperforms U-Net models like ADM (1120 Gflops) and LDM-4 (103.6 Gflops) while requiring significantly less compute for comparable or better results. At 512×512 resolution, DiT-XL/2 achieves an FID of 3.04, improving upon the previous best of 3.85 while using roughly 1/4 the Gflops of ADM-U.

Why replace the U-Net with a transformer if U-Nets are already successful?

The authors aim to demystify architectural choices and unify diffusion models with the broader transformer ecosystem. Transformers offer proven scaling properties, robustness, and efficiency that allow diffusion models to benefit from standard training recipes used in other domains.

Does increasing sampling steps compensate for a smaller, less compute-intensive model?

No. The authors find that scaling up the model's internal compute (Gflops) is the critical ingredient for quality; increasing sampling steps cannot compensate for a lack of model capacity.

Introduction and Motivation

Replacing the U‑Net backbone with a transformer unlocks better scaling for diffusion models.

Diffusion models have set the benchmark for image generation, yet they all share a convolutional U‑Net backbone that hampers scaling. Replacing this backbone with a transformer architecture promises better scalability and performance.

A U‑Net is a hierarchical encoder‑decoder that first compresses an image with convolutions, then reconstructs it while preserving spatial detail via skip connections.

With $L=12$ layers, the total attention memory is $12 \times 16\text{ KB} \approx 192\text{ KB}$.

If we double the token count to $N=128$ (a $16\times16$ grid), the matrix grows to $N^2 = 16{,}384$ entries, requiring $64\text{ KB}$ per layer and $768\text{ KB}$ overall.

The quadratic growth means that for $N=512$ tokens (typical for $512\times512$ images), memory would exceed $200\text{ GB}$, which is infeasible on current hardware.

The memory cost of full attention scales quadratically with the number of tokens, making large‑resolution diffusion models impractical with a U‑Net backbone.

**Figure 1.** Diffusion models with transformer backbones achieve state-of-the-art image quality. We show selected samples from two of our class-conditional DiT-XL/2 models trained on ImageNet at $512 \times 512$ and $256 \times 256$ resolution, respectively.

Switching from a U‑Net to a Transformer backbone unlocks better scaling and state‑of‑the‑art image quality in diffusion models.

Related Work

We survey transformer use in generative modeling and discuss complexity metrics.

Transformers have replaced domain‑specific architectures across language, vision, reinforcement learning, and meta‑learning, showing strong scaling with model size, compute, and data.

They serve as generic autoregressive models and Vision Transformers, and have been trained to predict pixels directly.

Transformers have also been applied to discrete codebooks, both as autoregressive and masked generative models, with the autoregressive variant scaling to 20 B parameters.

In diffusion models, transformers have been used to synthesize non‑spatial data such as CLIP image embeddings in DALL·E 2.

This work investigates how transformers scale when used as the backbone of image diffusion models.

Denoising diffusion probabilistic models (DDPMs) and score‑based generative models have become the leading approach for image synthesis, often surpassing GANs.

Recent DDPM improvements focus on sampling techniques—classifier‑free guidance, predicting noise instead of pixels, and cascaded pipelines that train low‑resolution base models alongside upsamplers.

Across these works, convolutional U‑Nets remain the default backbone, though concurrent attention‑based DDPM architectures have been proposed.

Evaluating model complexity by parameter count alone is unreliable because it ignores image resolution; the community therefore often uses theoretical GFLOPs as a proxy.

Nichol and Dhariwal’s seminal work analyzed the scalability and GFLOP properties of U‑Net‑based diffusion models; our study shifts the focus to the transformer class.

Diffusion Transformer Architecture

DiT replaces the U‑Net backbone with a token‑based transformer, enabling scalable diffusion modeling.

U‑Nets become FLOP‑heavy when image resolution grows, limiting diffusion scaling; the paper therefore swaps the convolutional backbone for a transformer that processes a sequence of latent patches.

DiT treats a latent image as a flat token sequence and runs a standard Vision‑Transformer stack over it, so the model inherits the transformer’s favorable scaling while still handling diffusion‑specific conditioning.

How does DiT differ from a vanilla Vision‑Transformer that processes image patches?

A vanilla ViT only sees the image tokens; DiT additionally injects diffusion‑specific conditioning (timestep, class) into each block, either as extra tokens, cross‑attention, or adaptive layer‑norm parameters, enabling the model to predict denoising steps.

Patchify produces 16 tokens $t_1,\dots,t_{16}$, each of shape $4$.

Positional sine‑cosine embeddings of length 16 are added to the tokens.

The token sequence (shape $16\times4$) enters the first DiT block.

Self‑attention mixes information across all 16 tokens.

After $N=2$ blocks, a linear decoder reshapes each token back to a $2\times2\times4$ patch.

The 16 patches are re‑assembled into the original $8\times8\times4$ latent.

Patch size directly controls token count $T$; halving $p$ quadruples $T$ and thus FLOPs, while parameter count stays unchanged.

Patchification flattens a 2‑D latent map into a 1‑D token stream, letting a transformer operate on spatial data without convolutions.

Is patchify just a strided convolution?

No. Patchify extracts non‑overlapping $p\times p$ blocks, flattens them, and applies a learned linear projection; a strided convolution would mix neighboring pixels and produce a different representation.

All variants keep the core ViT sub‑layer stack but differ in how they feed diffusion conditioning ($t$, $c$) into the block.

Why does adaLN‑Zero initialize the scaling $\alpha$ to zero?

Zero‑initializing $\alpha$ makes the residual branch start as an identity function, which stabilizes early training and mirrors the successful “zero‑init” trick used in ResNets and diffusion U‑Nets.

In‑context conditioning: the sequence becomes $[t,c,e_1,\dots,e_4]$ (length 6).

Self‑attention mixes all six tokens; each output token is a weighted sum of the others.

adaLN‑Zero computes $\gamma,\beta,\alpha$ from $t+c$; $\alpha$ is zero, so the residual branch adds nothing initially.

Residual connection adds the (still‑zero) scaled branch to the attention output, leaving the output unchanged.

After the block, the two conditioning tokens are dropped, leaving the four image tokens ready for the next block.

Because $\alpha$ starts at zero, the block behaves like a pure identity at the start of training, preventing destabilizing updates while still allowing the model to learn useful transformations later.

**Figure 3.** The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

**Figure 4.** Input specifications for DiT. Given patch size $p \times p$, a spatial representation (the noised latent from the VAE) of shape $I \times I \times C$ is "patchified" into a sequence of length $T = (I/p)^2$ with hidden dimension $d$. A smaller patch size $p$ results in a longer sequence length and thus more Gflops.

**Table 1.** Details of DiT models. We follow ViT [10] model configurations for the Small (S), Base (B) and Large (L) variants; we also introduce an XLarge (XL) config as our largest model.

Experimental Setup

We detail the training, diffusion pipeline, and evaluation protocol for DiT models.

DiT‑XL/2 trains at $5.7$ iterations per second on a TPU v3‑256 pod.

Measured throughput with a global batch size of 256 using the same hyperparameters as smaller models.

This throughput shows that even the largest DiT configuration can be trained efficiently on modern accelerator hardware, keeping scaling studies practical.

All DiT variants share a single, simple set of hyperparameters, making scaling comparisons fair and reproducible.

DiT models generate images by operating in a compressed latent space produced by a pretrained VAE, then decoding back to pixel space.

Model quality is primarily measured with Fréchet Inception Distance, complemented by secondary scores to capture different aspects of generation.

Results and Ablations

We quantify how conditioning, compute, and scaling drive diffusion quality.

Diffusion models have traditionally used U‑Net backbones; replacing them with a Transformer backbone enables better scaling and higher quality.

DiT‑XL/2 attains a new state‑of‑the‑art FID of $2.27$ on ImageNet 256×256, surpassing the previous best of $3.60$.

Table 2 shows DiT‑XL/2 achieving $2.27$ FID, the lowest among all listed models.

adaLN‑Zero initializes each DiT block as an identity mapping, then learns a lightweight affine shift; this removes the costly cross‑attention path while preserving conditioning flexibility.

How does adaLN‑Zero differ from the standard adaLN conditioning used in earlier diffusion models?

Standard adaLN learns a full affine transform (scale × input + shift) from random initialization, adding extra compute and sometimes destabilizing early training. adaLN‑Zero, by contrast, starts as the identity (scale = 1, shift = 0) and only learns the shift, eliminating the extra scaling cost and yielding a more stable training dynamics.

Increasing the transformer’s compute (Gflops) or the number of input tokens (by shrinking patch size) yields a predictable drop in FID; the relationship is roughly linear on a log‑log plot.

Why does shrinking the patch size improve quality if the total number of parameters stays the same?

Smaller patches increase the token count, so the self‑attention layers attend over a finer spatial grid. This raises the model’s effective compute (more FLOPs) while leaving the parameter matrix unchanged, allowing the network to capture higher‑frequency details without over‑parameterizing.

**Figure 2.** ImageNet generation with Diffusion Transformers (DiTs). Bubble area indicates the flops of the diffusion model. Left: FID-50K (lower is better) of our DiT models at 400K training iterations. Performance steadily improves in FID as model flops increase. Right: Our best model, DiT-XL/2, is compute-efficient and outperforms all prior U-Net-based diffusion models, like ADM and LDM.

**Figure 5.** Comparing different conditioning strategies. adaLN-Zero outperforms cross-attention and in-context conditioning at all stages of training.

**Figure 7.** Increasing transformer forward pass Gflops increases sample quality. Best viewed zoomed-in. We sample from all 12 of our DiT models after 400K training steps using the same input latent noise and class label. Increasing the Gflops in the model—either by increasing transformer depth/width or increasing the number of input tokens—yields significant improvements in visual fidelity.

**Figure 9.** Larger DiT models use large compute more efficiently. We plot FID as a function of total training compute.

**Table 2.** Benchmarking class-conditional image generation on ImageNet 256x256. DiT-XL/2 achieves state-of-the-art FID.

**Table 3.** Benchmarking class-conditional image generation on ImageNet 512x512. Note that prior work [9] measures Precision and Recall using 1000 real samples for 512 x 512 resolution; for consistency, we do the same.

Additional Scaling Analysis

Scaling the Diffusion Transformer improves multiple quality metrics and training efficiency.

We first examine how increasing the size of the Diffusion Transformer (DiT) affects a broad suite of evaluation metrics.

**Figure 12.** DiT scaling behavior on several generative modeling metrics. Left: We plot model performance as a function of total training compute for FID, sFID, Inception Score, Precision and Recall. Right: We plot model performance at 400K training steps for all 12 DiT variants against transformer Gflops, finding strong correlations across metrics. All values were computed using the ft-MSE VAE decoder.

Scaling also accelerates convergence of the training loss. Larger DiT models reach lower loss values earlier and saturate at a smaller final loss, mirroring the behavior reported for language‑model scaling.

Next we assess whether the choice of VAE decoder influences the scaling trends.

All three decoders yield comparable results, but the fine‑tuned ft‑EMA variant consistently attains the best scores.

These FLOP counts illustrate why the Transformer backbone offers a more favorable compute‑to‑performance trade‑off than traditional U‑Net diffusion models.

Qualitative Samples

A gallery of uncurated DiT‑XL/2 generations illustrating visual quality across diverse classes.

**Figure 16.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = "husky" (250)

**Figure 17.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = "sulphur-crested cockatoo" (89)

**Figure 18.** Uncurated $512 \times 512$ DiT-XL/2 samples. Classifier-free guidance scale = 4.0. Class label = "cliff drop-off" (972)

**Figure 19.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0. Class label = “balloon” (417)

**Figure 20.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0. Class label = "lion" (291)

**Figure 21.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = "otter" (360)

**Figure 22.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 2.0 Class label = “red panda” (387)

**Figure 23.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 2.0 Class label = "panda" (388)

**Figure 24.** Uncurated $512 \times 512$ DiT-XL/2 samples. Classifier-free guidance scale = 1.5. Class label = "coral reef" (973)

**Figure 25.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 1.5 Class label = “macaw” (88)

**Figure 26.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 4.0. Class label = "macaw" (88)

**Figure 27.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 4.0. Class label = "dog sled" (537)

**Figure 28.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = "arctic fox" (279)

**Figure 29.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = "loggerhead sea turtle" (33)

**Figure 30.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 2.0. Class label = "golden retriever" (207)

**Figure 31.** Uncurated 256 × 256 DiT-XL/2 samples. Classifier-free guidance scale = 2.0 Class label = “lake shore” (975)

**Figure 32.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 1.5 Class label = "space shuttle" (812)

**Figure 33.** Uncurated $256 \times 256$ DiT-XL/2 samples. Classifier-free guidance scale = 1.5 Class label = "ice cream" (928)

Implementation Details

Implementation specifics: model embeddings, adaLN‑Zero details, guidance scaling, and sample generation.

Section A provides the low‑level implementation choices for the Diffusion Transformer (DiT) models used throughout the paper.

Timesteps are embedded with a $256$‑dimensional sinusoidal frequency vector [9] followed by a two‑layer MLP whose hidden dimension matches the transformer’s hidden size; both layers use SiLU activations.

Each adaLN layer adds the summed timestep and class embeddings, passes them through a SiLU nonlinearity, then a linear projection whose output dimension is either $4\times$ (adaLN) or $6\times$ (adaLN‑Zero) the hidden size.

The core transformer blocks employ GELU nonlinearities, approximated with tanh for computational efficiency.

For classifier‑free guidance we modify only the first three latent channels instead of all four; empirically, three‑channel guidance with scale $(1\!+\!x)$ matches four‑channel guidance with scale $(1\!+\!\tfrac{3}{4}x)$ (e.g., $1.5$ vs. $1.375$ yields comparable FID‑50K).

**Figure 11.** Additional selected samples from our 512x512 and 256x256 resolution DiT-XL/2 models. We use a classifier-free guidance scale of 6.0 for the 512x512 model and 4.0 for the 256x256 model. Both models use the ft-EMA VAE decoder.

Section B reports additional model samples: the $512\times512$ DiT‑XL/2 model was trained for $3\,$M steps, the $256\times256$ version for $7\,$M steps; both were evaluated with classifier‑free guidance scales of $6.0$ and $4.0$, respectively, using the ft‑EMA VAE decoder.

**Figure 13.** Training loss curves for all DiT models. We plot the loss over training for all DiT models (the sum of the noise prediction mean-squared error and $\mathcal{D}_{KL}$). We also highlight early training behavior. Note that scaled-up DiT models exhibit lower training losses.

Table 4 enumerates all DiT model variants (sizes S, B, L, XL) and their hyperparameters; Table 6 reports GFLOP counts for baseline U‑Net diffusion models (ADM, LDM) for comparison.

**Figure 14.** Uncurated 512 × 512 DiT-XL/2 samples. Classifier-free guidance scale = 4.0 Class label = “arctic wolf” (270)

Read the original paper

Open the simplified reader on Paperglide