High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

Latent Diffusion Models (LDMs) perform image synthesis in a compressed latent space to reduce compute costs.

How can we perform high-resolution image synthesis using diffusion models without the prohibitive computational cost of operating directly in pixel space?

Diffusion models produce high-quality images but are computationally expensive because they operate directly on high-dimensional pixel data. The authors introduce Latent Diffusion Models (LDMs), which perform the diffusion process in a lower-dimensional latent space learned by a pretrained autoencoder. This separation of perceptual compression from generative modeling allows the model to focus on semantic structure rather than imperceptible pixel-level details. LDMs achieve state-of-the-art performance on image inpainting and class-conditional synthesis while significantly reducing training and inference costs compared to pixel-based diffusion models.

Paper Primer

The core move is to train a perceptual autoencoder once, then train the diffusion model to generate latent representations rather than raw pixels. By using a convolutional UNet backbone in this latent space, the model retains spatial inductive biases that purely transformer-based approaches lose when flattening data.

LDMs significantly reduce computational requirements for high-resolution synthesis.

Training and inference throughput are improved by at least 2.7× compared to pixel-based diffusion models on inpainting tasks. Achieves state-of-the-art FID scores on CelebA-HQ (5.11) and competitive results on text-to-image synthesis with fewer parameters.

To enable flexible conditioning (e.g., text, bounding boxes), the authors augment the UNet with cross-attention layers. This allows the model to map external inputs into the intermediate layers of the diffusion process, turning the LDM into a general-purpose conditional generator.

Why is this approach more efficient than previous two-stage methods?

Previous methods often relied on autoregressive transformers that required aggressive, quality-reducing compression to manage sequence length. LDMs use a convolutional backbone that scales more gracefully to higher-dimensional latent spaces, allowing for less aggressive compression and better detail preservation.

What is the primary trade-off when choosing the latent space compression factor?

Too little compression (e.g., pixel-space) results in slow training progress, while too much compression causes information loss that limits the final image fidelity. The authors find that a downsampling factor of 4 to 8 provides the optimal balance between efficiency and reconstruction quality.

Researchers can now train high-fidelity generative models on consumer-grade hardware by leveraging a fixed, pretrained autoencoder, shifting the focus from raw pixel modeling to semantic latent synthesis.

Abstract

Latent diffusion models cut compute dramatically while keeping image quality high.

Diffusion models excel at image synthesis but operate in pixel space, making training and inference costly.

We move the diffusion process into the latent space of a pretrained autoencoder, preserving perceptual detail while drastically lowering compute.

Adding cross‑attention layers lets the model condition on arbitrary inputs such as text or bounding boxes, enabling high‑resolution generation in a convolutional fashion.

Our latent diffusion models set new state‑of‑the‑art scores on image inpainting and class‑conditional synthesis, and remain competitive on text‑to‑image, unconditional generation, and super‑resolution, all with far less computation.

The Computational Bottleneck of Diffusion

We expose why pixel‑space diffusion is costly and motivate moving generation to a compact latent space.

High‑resolution image synthesis demands enormous compute: pixel‑space diffusion models must repeatedly process full‑resolution RGB tensors, and training the most powerful variants can consume hundreds of GPU‑days and days of inference per 50 k samples.

Diffusion models that operate directly on full‑resolution images must allocate compute to every pixel, even to tiny variations that are invisible to human observers, so most of the budget is spent on modeling noise rather than semantic content.

Compute the number of pixels: $256\times256\times3 = 196{,}608$.

Square this count to obtain the attention matrix size: $196{,}608^2 \approx 3.86\times10^{10}$ entries.

At 4 bytes per float, memory needed ≈ $3.86\times10^{10}\times4\text{ B} \approx 154\text{ GB}$, far beyond a typical GPU’s capacity.

Memory grows quadratically with image resolution, so pixel‑space diffusion quickly exceeds hardware limits for megapixel generation.

Consequently, a method that moves the diffusion process into a compact latent space—where the dimensionality scales far more gently with image size—can dramatically cut both training and inference costs while preserving perceptual quality.

The key trade‑off is achieving high‑resolution synthesis without the prohibitive compute that pixel‑space diffusion demands.

Generative Modeling Landscape

We position our approach among prior generative models and two‑stage pipelines.

This section surveys the main families of image synthesis models and highlights two‑stage pipelines that inspire our latent diffusion approach.

GANs generate high‑resolution images by training a generator‑discriminator pair in an adversarial game.

VAEs learn a latent distribution by maximizing a variational lower bound on the data likelihood.

Flow models construct an invertible mapping between data and a simple prior, enabling exact likelihood computation.

ARMs factorize the joint distribution into a product of conditionals, generating pixels sequentially.

These methods first compress images into a lower‑dimensional latent space, then model that space with a generative prior.

DMs iteratively denoise a random tensor, learning a reverse diffusion process that yields high‑quality samples.

UNet architectures provide multi‑scale context for the denoising network in diffusion models.

A training loss that emphasizes later diffusion steps, improving sample quality at the cost of a modest bias.

VQ‑VAEs learn a discrete latent codebook, enabling an autoregressive prior over compressed image tokens.

These works extend VQ‑VAEs to jointly model image and text tokens, enabling cross‑modal generation.

Invertible architectures that map between arbitrary latent spaces while preserving tractable Jacobians.

VQ‑GANs combine an adversarial loss with a perceptual objective to train a high‑capacity encoder‑decoder pair.

Latent Diffusion Models

We replace costly pixel‑space diffusion with a low‑dimensional latent diffusion pipeline.

Pixel‑space diffusion models still evaluate every pixel at every denoising step, making high‑resolution synthesis prohibitively expensive.

We first learn a compact latent representation that looks like a normal image to a human observer, so the later diffusion only has to model the “important” visual content.

Encoder $E$ averages each non‑overlapping $2\times2$ patch, yielding four latent values $z_{11},z_{12},z_{21},z_{22}$.

KL‑reg computes $\text{KL}\bigl(\mathcal{N}(z,\mathbf{I})\;\|\;\mathcal{N}(0,\mathbf{I})\bigr)$, encouraging the four values to stay near zero mean.

Decoder $D$ upsamples each latent value back to a $2\times2$ patch by nearest‑neighbor replication, then applies a small convolution to refine colors.

The reconstructed image $\tilde{x}$ matches the original up to minor color shifts, demonstrating that $f=2$ preserves perceptual quality.

Even a modest factor $f=2$ cuts the pixel count from 16 to 4, a 4× reduction in computation for every diffusion step.

Why not use a plain VAE without the adversarial patch loss?

Pure pixel‑wise losses (L2/L1) tend to produce blurry reconstructions because they penalize high‑frequency errors uniformly. The adversarial patch loss forces local realism, preserving texture that the VAE’s KL term alone would smooth away.

Sample a batch of images $x$.

Encode to latents $z=E(x)$.

Decode to reconstructions $\tilde{x}=D(z)$.

Compute perceptual loss between $x$ and $\tilde{x}$ using a pretrained VGG feature extractor.

Compute the patch‑based adversarial loss with a discriminator that judges $2\times2$ patches of $\tilde{x}$.

Optionally add KL or VQ regularization on $z$.

Back‑propagate the weighted sum of all losses and update $E$ and $D$.

After compressing images into a low‑dimensional latent, we run the diffusion process directly on that latent, so each denoising step works on far fewer values.

Step 1 ($t=3$): UNet predicts noise $\hat\epsilon_3 = (0.48,\,-0.28,\,0.12,\,0.22)$; subtract to obtain $z_2 = z_3 - \hat\epsilon_3 = (0.02,\,-0.02,\,-0.02,\, -0.02)$.

Step 2 ($t=2$): Predict $\hat\epsilon_2 \approx (0.01,\,-0.01,\,0.00,\,-0.01)$; update $z_1 = z_2 - \hat\epsilon_2 \approx (0.01,\,-0.01,\,-0.02,\,-0.01)$.

Step 3 ($t=1$): Predict $\hat\epsilon_1 \approx (0.01,\,-0.01,\,0.00,\,-0.01)$; obtain clean latent $z_0 = z_1 - \hat\epsilon_1 \approx (0.00,\,0.00,\, -0.02,\,0.00)$.

Decode $z_0$ with $D$ to get a $4\times4$ image that closely resembles the original.

The three denoising steps operate on only four numbers instead of 16 pixels, illustrating the massive compute saving.

Why not run diffusion directly on the VQ‑quantized codes instead of the continuous latent $z$?

Quantized codes are discrete, which forces the diffusion model to treat them as categorical variables and requires a much larger model to capture the combinatorial space. Continuous latents keep the diffusion dynamics smooth and allow the same Gaussian noise schedule used in pixel‑space diffusion.

We inject external information (e.g., text) into the diffusion UNet by letting the latent query attend to a learned representation of the condition.

Project queries: $Q = W_{Q}\phi_{i}(z_t)$ yields a $2\times4$ matrix.

Project keys/values: $K=V = W_{K,V}\tau_{\theta}(y)$ yields a $2\times4$ matrix.

Compute similarity $QK^{\top}$ → a $2\times2$ score matrix.

Apply softmax row‑wise to obtain attention weights, e.g. $\begin{bmatrix}0.7&0.3\\0.4&0.6\end{bmatrix}$.

Multiply weights by $V$ to get attended condition vectors for each spatial position.

Even with only two tokens, the attention distributes semantic information across the latent map, showing how a short prompt can influence every patch of the generated image.

How does this cross‑attention differ from the simple concatenation of $y$ to the latent?

Concatenation forces the UNet to treat the condition as additional channels, mixing it uniformly across all locations. Cross‑attention lets each spatial query select which parts of $y$ are relevant, yielding a more expressive and location‑specific conditioning.

**Figure 1.** Boosting the upper bound on achievable quality with less aggressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models, see Sec. 3. Images are from the DIV2K [1] validation set, evaluated at $512^2$ px. We denote the spatial downsampling factor by $f$. Reconstruction FIDs [29] and PSNR are calculated on ImageNet-val. [12]; see also Tab. 8.

**Figure 2.** Illustrating perceptual and semantic compression: Most bits of a digital image correspond to imperceptible details. While DMs allow to suppress this semantically meaningless information by minimizing the responsible loss term, gradients (during training) and the neural network backbone (training and inference) still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference. We propose latent diffusion models (LDMs) as an effective generative model and a separate mild compression stage that only eliminates imperceptible details. Data and images from [30].

**Figure 3.** We condition LDMs either via concatenation or by a more general cross-attention mechanism. See Sec. 3.3

Empirical Evaluation

Key quantitative gains of latent diffusion over pixel‑space diffusion.

LDM‑8 attains a 38‑point lower FID than the pixel‑based diffusion baseline (LDM‑1) after 2 M training steps.

Figure 6 shows the training curves; the gap between the LDM‑1 and LDM‑8 curves is 38 FID points at the 2 M‑step mark.

**Figure 4.** Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class-conditional ImageNet [12], each with a resolution of 256 × 256. Best viewed when zoomed in. For more samples cf. the supplement.

**Figure 5.** Samples for user-defined text prompts from our model for text-to-image synthesis, $LDM-8 (KL)$, which was trained on the LAION [78] database. Samples generated with 200 DDIM steps and $\eta = 1.0$. We use unconditional guidance [32] with $s = 10.0$.

**Figure 6.** Analyzing the training of class-conditional LDMs with different downsampling factors $f$ over 2M train steps on the ImageNet dataset. Pixel-based LDM-1 requires substantially larger train times compared to models with larger downsampling factors (LDM-{4-16}). Too much perceptual compression as in LDM-32 limits the overall sample quality. All models are trained on a single NVIDIA A100 with the same computational budget. Results obtained with 100 DDIM steps [84] and $\kappa = 0$.

**Figure 7.** Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate {10, 20, 50, 100, 200} sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM-{4-8}. FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.

Super-Resolution and Conditioning

Latent Diffusion delivers faster training and higher‑quality super‑resolution and inpainting across benchmarks.

Latent Diffusion models train over $2.7\times$ faster than pixel‑based diffusion while delivering $1.6\times$ better FID scores.

Table 6 shows training throughput $0.97$ vs $0.26$ samples/sec (≈$3.7\times$) and FID reduction from $1.6\times$ lower values for LDM‑4 compared to LDM‑1.

Conditioning the diffusion model on a low‑resolution image by concatenating it to the latent input lets the model focus on generating missing high‑frequency details.

How does this differ from naïve bicubic upsampling?

Bicubic upsampling merely interpolates pixel values, whereas Super‑Resolution Conditioning supplies a learned latent prior that guides the generation of realistic high‑frequency textures.

**Table 2.** Evaluation of text-conditional image synthesis on the 256 × 256-sized MS-COCO dataset: with 250 DDIM steps our model is on par with the most recent diffusion and autoregressive methods despite using significantly less parameters. †/*:Numbers from [109]/ [26]

**Figure 8.** Layout-to-image synthesis with an $LDM$ on COCO [4], see Sec. 4.3.1. Quantitative evaluation in the supplement D.3.

**Figure 10.** ImageNet 64→256 super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. See appendix for additional samples and cropouts. SR3 results from [72].

**Table 4.** Task 1: Subjects were shown ground truth and generated image and asked for preference. Task 2: Subjects had to decide between two generated images. More details in E.3.6

**Figure 11.** Qualitative results on object removal with our big, wft inpainting model. For more results, see Fig. 22.

**Figure 21.** Qualitative results on image inpainting. In contrast to [88], our generative approach enables generation of multiple diverse samples for a given input.

Limitations and Impact

Latent diffusion cuts compute but still faces speed and fidelity limits.

Diffusion models are expensive because they operate on high‑dimensional pixels; moving the diffusion to a latent space via a pre‑trained autoencoder cuts compute while preserving perceptual quality. Even so, latent diffusion models still sample sequentially, making them slower than GANs, and their reconstruction step can become a bottleneck when fine‑grained pixel accuracy is required. Our super‑resolution variants inherit this limitation, as they rely on the same latent‑space pipeline.

Generative image models broaden creative possibilities and, by lowering training and inference costs, democratize access to powerful synthesis tools. Conversely, the reduced barrier also eases the production of manipulated media—deep‑fakes that disproportionately target women—and facilitates the spread of misinformation. These models can inadvertently expose training data, raising privacy concerns, and may amplify existing societal biases, especially given our two‑stage approach that combines adversarial training with a likelihood objective.

For a comprehensive treatment of the ethical issues surrounding deep generative models, see the survey in [13].

**Table 6.** Assessing inpainting efficiency. †: Deviations from Fig. 7 due to varying GPU settings/batch sizes cf. the supplement.

Conclusion

Conclusion, acknowledgments, and a key visual illustration.

We presented latent diffusion models that cut training and sampling costs dramatically while preserving image quality, and we showed that cross‑attention conditioning extends these gains across many conditional synthesis tasks.

This work was supported by the German Federal Ministry for Economic Affairs and Energy (project “KI‑Absicherung – Safe AI for automated driving”) and the German Research Foundation (DFG) project 421703927.

**Figure 13.** Combining classifier free diffusion guidance with the convolutional sampling strategy from Sec. 4.3.2, our 1.45B parameter text-to-image model can be used for rendering images larger than the native 256^2 resolution the model was trained on.

The following references provide the full bibliography for the methods, datasets, and prior work discussed throughout the paper.

Diffusion Model Details

Hyperparameter details and equations governing the diffusion process and its training loss.

The diffusion schedule is expressed through a per‑step signal‑to‑noise ratio, which controls how much of the original signal survives relative to added noise.

Guidance can be injected at sampling time by modifying the predicted noise with a classifier gradient, allowing unconditional models to be steered toward a target condition.

Choosing a high signal‑to‑noise ratio in the latent space allocates too much semantic detail early in the reverse process, which harms convolutional sampling; rescaling the latent by its component‑wise standard deviation lowers the SNR and improves synthesis quality.

The appendix contains a complete table of all first‑stage auto‑encoding models trained on OpenImages (see Table 8).

Layout‑to‑image synthesis trained on COCO and fine‑tuned from OpenImages achieves state‑of‑the‑art performance, improving FID by roughly 11 points over prior work.

Class‑conditional ImageNet LDMs (e.g., LDM‑8) use far fewer parameters yet reach competitive FID and Inception Score; training a per‑noise‑scale classifier for guidance yields further gains.

Implementation and Metrics

Implementation details cover model hyperparameters, training schedules, and evaluation protocols.

This section records the concrete settings used to train and evaluate the Latent Diffusion Models. It lists model sizes, diffusion schedules, and compute budgets, then points to the tables that detail each configuration.

$\tau$$\theta$ is an unmasked transformer that turns a tokenized conditioning input (e.g., text or layout) into a sequence of embeddings $\zeta$, which the UNet consumes via cross‑attention.

Table 9 compares six layout‑to‑image methods on COCO and OpenImages. The authors’ LDM‑4 model attains the lowest FID in all three columns (40.91, 32.02, 35.80), outperforming prior GAN‑based approaches.

Table 11 reports 4× upscaling on ImageNet‑Val. The LDM‑4 variant with 100 steps plus 15 extra epochs achieves the best FID (2.6/4.6) and matches the highest PSNR (27.9) and SSIM (0.801) among all methods.

**Figure 18.** $LDM-BSR$ generalizes to arbitrary inputs and can be used as a general-purpose upsampler, upscaling samples from a class-conditional $LDM$ (image $cf.$ Fig. 4) to $1024^2$ resolution. In contrast, using a fixed degradation process (see Sec. 4.4) hinders generalization.

Table 12 lists the hyperparameters for unconditional LDMs (CelebA‑HQ, FFHQ, LSUN‑Churches, LSUN‑Bedrooms). All models share a linear noise schedule, 1000 diffusion steps, and a batch size of 48 k (except LSUN‑Churches, which uses 96). The total parameter count is around 274 M.

Table 13 details the conditional ImageNet LDMs (LDM‑1 to LDM‑32). Model size grows from 396 M (LDM‑1) to 395 M (LDM‑32) with increasing batch sizes (7 → 112) and learning rates tuned per scale. All share the same 1000‑step diffusion schedule and 512‑dimensional conditioning embeddings.

Table 14 provides the CelebA unconditional LDM hyperparameters. All variants use 1000 diffusion steps, a linear schedule, and depth 2. The channel multiplier and attention resolutions vary to balance compute and quality across model sizes.

Table 15 aggregates hyperparameters for all conditional tasks (text‑to‑image, layout‑to‑image, super‑resolution, etc.). Notable differences include the number of heads (1–8), dropout rates, and whether cross‑attention or concatenation is used for conditioning.

Table 16 describes the transformer block that replaces the self‑attention layer in the UNet. It shows the sequence of operations (LayerNorm → Conv1×1 → Reshape → Self‑Attention → MLP → Cross‑Attention) and the resulting tensor shapes in terms of height h, width w, channels c, head dimension d, and head count $n_h$.

Section D.5 evaluates sample quality versus V100‑day compute. Using 100 DDIM steps, the authors report FID trends that mirror those observed over training steps, confirming that resource‑based curves are consistent with step‑based curves.

Section D.6 introduces LDM‑BSR, a general‑purpose super‑resolution model that replaces the fixed bicubic degradation with a diverse pipeline (JPEG noise, sensor noise, random blur, etc.). This yields sharper upscaled images on LSUN‑cows and improves real‑world applicability.

Section E.3 outlines the evaluation pipeline. FID, Precision, and Recall are computed on 50 k generated samples; IS and PSNR are also reported where appropriate. All metrics use the torch‑fidelity implementation for consistency.

Sections E.3.1–E.3.6 detail task‑specific protocols: unconditional synthesis (Tab 1, 10), text‑to‑image (Tab 2), layout‑to‑image (Tab 9), super‑resolution (Tab 5, 11), efficiency analysis (Fig. 6, 17, 7), and a 2‑alternative forced‑choice user study (Tab 4).

Computational Requirements

We compare compute and throughput of our Latent Diffusion models against prior generative methods.

Table 18 aggregates training‑compute (in V100‑days), inference throughput (samples / sec), parameter count, and FID scores for a range of generative models, including our Latent Diffusion variants and baselines such as StyleGAN2 and ADM.

To compare fairly we convert our A100‑day measurements to V100‑days by assuming a 2.2× speedup of the A100 over the V100, as reported in prior hardware studies.

**Table 18.** Comparing compute requirements during training and inference throughput with state-of-the-art generative models. Compute during training in V100-days, numbers of competing methods taken from [15] unless stated differently;*: Throughput measured in samples/sec on a single NVIDIA A100;†: Numbers taken from [15];‡: Assumed to be trained on 25M train examples;††: R-FID vs. ImageNet validation set

Autoencoder Architecture

Details of the autoencoder architecture, training objectives, and latent regularization strategies.

We train each autoencoder adversarially: a patch‑based discriminator $D_{\psi}$ learns to distinguish real images from reconstructions $D(E(x))$, while the encoder–decoder pair $(E,D)$ tries to fool it.

To prevent arbitrary scaling of the latent space, we add a regularizing loss $L_{\text{reg}}$ that forces $z$ to be zero‑centered with small variance.

We explore two regularization schemes: (i) a KL term (weight ≈ 10⁻⁶) between the encoder posterior $q_{E}(z|x)$ and a standard normal, and (ii) a vector‑quantization layer that learns a codebook of $|Z|$ exemplars; the high dimensionality of the codebook likewise keeps the regularization effect weak.

The complete training objective combines reconstruction, adversarial, log‑discriminator, and regularization terms (see equation (25)).

When training diffusion models in the learned latent space we distinguish two cases. For KL‑regularized latents we sample $z = E_{\mu}(x) + E_{\sigma}(x)\,\epsilon$, estimate the component‑wise variance $\hat{\sigma}^{2}$ from the first batch, normalize $z$ to unit variance, and absorb the quantization step into the decoder. For VQ‑regularized latents we simply take the encoder output $\hat{\sigma}=E(x)$ as the latent before the quantization layer.

Read the original paper

Open the simplified reader on Paperglide