Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

Diffusion models generate high-quality images by learning to reverse a fixed process that gradually adds noise to data.

How can we train a diffusion-based generative model to produce high-quality images by simplifying the training objective to a denoising task?

Generative models like GANs and VAEs produce high-quality samples, but often struggle with training stability or architectural complexity. Diffusion models address this by defining a Markov chain that gradually destroys data with Gaussian noise, then learning a neural network to reverse this process step-by-step. This approach achieves state-of-the-art image synthesis quality, producing samples that outperform many existing generative architectures on standard benchmarks.

Paper Primer

The core mechanism is a reverse Markov chain that learns to denoise data. By parameterizing the model to predict the noise added at each step, the training objective simplifies to a weighted mean-squared error that is mathematically equivalent to denoising score matching.

Diffusion models achieve state-of-the-art sample quality on unconditional image generation.

On the CIFAR-10 dataset, the model achieves an Inception score of 9.46 and an FID score of 3.17. The FID score of 3.17 represents a significant improvement over many established generative models, including class-conditional architectures.

While the models excel at synthesis, they are less competitive in lossless log-likelihood compared to other likelihood-based models. The authors demonstrate that this is because the majority of the model's capacity is spent describing imperceptible image details, effectively acting as a progressive lossy compressor.

Why use a fixed diffusion process instead of learning the forward noise schedule?

Fixing the forward process to a constant variance schedule simplifies the training objective and removes the need for learnable parameters in the forward chain, allowing the model to focus entirely on learning the reverse denoising transitions.

How does this approach compare to existing score-based generative models?

The authors establish an explicit equivalence between their diffusion model parameterization and denoising score matching over multiple noise levels, effectively training a Langevin-like sampler using variational inference.

Researchers can now treat diffusion as a robust, high-quality alternative to GANs for image synthesis, with the added benefit of a clear, progressive decoding interpretation for lossy compression.

Introduction

We introduce diffusion probabilistic models that achieve high‑quality image synthesis by directly predicting noise.

Deep generative models such as GANs, autoregressive networks, normalizing flows, and VAEs have recently produced striking samples across image and audio domains. Diffusion probabilistic models (Denoising Diffusion Probabilistic Models, DDPM) have so far lacked comparable demonstrations of high‑quality synthesis.

We show that diffusion models can achieve high‑quality image synthesis by training a neural network to predict the added noise at each timestep, turning the reverse diffusion into a simple denoising task.

A diffusion model defines a forward noising chain that gradually corrupts data, and a learned reverse chain that denoises step by step; training asks the network to predict the noise component, which is easier than predicting the clean data directly.

Step 1 adds Gaussian noise to the clean image, producing 48 noisy values.

Step 2 adds additional noise, again yielding 48 values (the previous noise is retained in the chain).

Repeating for steps 3–5 results in $5\times48=240$ stored scalars total.

The reverse network predicts the noise $\epsilon$ at each step, allowing reconstruction of the original image after 5 denoising passes.

Even with a tiny $4\times4$ toy example, the diffusion chain requires only a few hundred scalars, illustrating that the memory burden scales linearly with image size and number of steps.

**Figure 1.** Generated samples on CelebA-HQ 256 × 256 (left) and unconditional CIFAR10 (right)

Diffusion probabilistic models enable high‑quality image synthesis by directly predicting noise, offering a scalable alternative to GANs and autoregressive decoders.

Diffusion Model Foundations

Defines the reverse generative chain and the fixed forward noise schedule.

Directly modelling $p(x_0)$ is intractable, so diffusion models replace it with a tractable Markov chain that walks from pure noise back to data.

Generation is a stepwise denoising walk: each step predicts how to undo a tiny slice of noise, much like rewinding a video frame‑by‑frame.

Step 3→2: sample $x_2\sim\mathcal N\!\bigl(\sqrt{1-\beta_3}\,x_3,\;\beta_3 I\bigr)$.

Step 2→1: sample $x_1\sim\mathcal N\!\bigl(\sqrt{1-\beta_2}\,x_2,\;\beta_2 I\bigr)$.

Step 1→0: sample $x_0\sim\mathcal N\!\bigl(\sqrt{1-\beta_1}\,x_1,\;\beta_1 I\bigr)$, which is the generated image.

The reverse chain simply inverts the forward noise additions; each Gaussian “undoes” one scheduled noise slice.

The forward diffusion process is a *fixed* Markov chain that adds Gaussian noise according to the schedule $\beta_t$: $q(x_t\mid x_{t-1})=\mathcal N\!\bigl(x_t;\sqrt{\beta_t}\,x_{t-1},\beta_t I\bigr)$.

Training maximises a variational lower‑bound on $-\log p_\theta(x_0)$. After expanding the bound it becomes a sum of KL divergences between forward and reverse Gaussians (Eq 5), which can be evaluated analytically.

**Figure 2.** The directed graphical model considered in this work.

How does this reverse Gaussian chain differ from a standard denoising autoencoder?

In a denoising autoencoder the corruption distribution is fixed but the reconstruction map is deterministic; here each reverse step is a *probabilistic* Gaussian whose mean is predicted by a neural network, and the chain is explicitly trained to match the forward posterior at every timestep.

Theoretical Framework

We formalize forward diffusion and the reverse denoising process that powers sampling.

Diffusion models define a stochastic forward diffusion that gradually corrupts data, then learn a reverse denoising process that reconstructs the original sample.

The forward diffusion repeatedly adds isotropic Gaussian noise, turning a clean image into pure noise after many steps.

How does this forward diffusion differ from simply adding Gaussian blur to an image?

Blur mixes neighboring pixels spatially, whereas the forward diffusion adds independent noise to every pixel, destroying all structure gradually but uniformly across the image.

Compute $\sqrt{\bar{\alpha}_{1}} = \sqrt{0.8}\approx 0.894$ and $\sqrt{1-\bar{\alpha}_{1}} = \sqrt{0.2}\approx 0.447$.

Sample $\epsilon\sim\mathcal{N}(0,1)$; suppose $\epsilon=0.5$.

Form $x_{1}=0.894\cdot 1.0 + 0.447\cdot 0.5 \approx 1.117$.

The forward step scales the original signal down and injects a modest amount of noise; repeating with the same $\beta$ quickly drives any signal toward a standard normal.

Think of the reverse process as rewinding a blurry photograph frame by frame, each step subtracting just enough noise to make the picture clearer.

Why not train the network to predict the original image $x_{0}$ directly?

Predicting $x_{0}$ forces the model to invert the entire forward diffusion in one shot, which is a highly nonlinear mapping. Predicting $\tilde{\mu}_{t}$ only requires undoing the noise added at the current step, a much simpler, locally linear task that aligns with denoising score matching.

Step 1: Sample $\epsilon_{1}=0.4$. Compute $x_{1}= \sqrt{0.75}\cdot2.0 + \sqrt{0.25}\cdot0.4 \approx 1.732 + 0.200 = 1.932$.

Step 2: Sample $\epsilon_{2}=-0.3$. Compute $x_{2}= \sqrt{0.5625}\cdot2.0 + \sqrt{0.4375}\cdot(-0.3) \approx 1.5 - 0.198 = 1.302$.

Posterior mean at $t=2$: $\tilde{\mu}_{2}= \frac{1}{\sqrt{0.5625}}\bigl(1.302 - \frac{1-0.5625}{\sqrt{0.5625}}(-0.3)\bigr) \approx 1.732$.

Network target $\mu_{\theta}(x_{2},2)$ should equal $\tilde{\mu}_{2}\approx1.732$ (plus constant $C$).

The network never needs to reconstruct the full $x_{0}=2.0$ in one step; it only learns to pull the noisy latent toward the intermediate posterior mean, which is much easier.

Training proceeds by sampling a random timestep, corrupting a clean sample, and minimizing the squared error between the network’s mean prediction and the true posterior mean.

Sample a clean datum $x_{0}\sim q(x_{0})$.

Draw a timestep $t\sim\text{Uniform}\{1,\dots,T\}$.

Sample noise $\epsilon\sim\mathcal{N}(0,I)$ and form the noisy latent $x_{t}= \sqrt{\bar{\alpha}_{t}}\,x_{0}+ \sqrt{1-\bar{\alpha}_{t}}\,\epsilon$.

Compute the posterior mean $\tilde{\mu}_{t}(x_{t},x_{0})$ using the closed‑form expression.

Take a gradient‑descent step on $\bigl\|\mu_{\theta}(x_{t},t)-\tilde{\mu}_{t}(x_{t},x_{0})\bigr\|^{2}$.

Repeat until convergence.

Sampling starts from pure Gaussian noise and iteratively applies the learned reverse Gaussian, gradually removing noise until a clean sample emerges.

Sample $x_{T}\sim\mathcal{N}(0,I)$.

For $t = T, T-1, \dots, 1$:

Sample $z\sim\mathcal{N}(0,I)$ if $t>1$; otherwise set $z=0$.

Compute $x_{t-1}= \frac{1}{\sqrt{\alpha_{t}}}\Bigl(x_{t}-\frac{1-\alpha_{t}}{\sqrt{\alpha_{t}}}\,\mu_{\theta}(x_{t},t)\Bigr)+\sigma_{t}z$.

Return $x_{0}$ as the generated sample.

Model Design and Training

Design choices for data scaling, a discrete decoder, and a simplified noise‑prediction loss.

We treat each image as a tensor of integer values $x\in\{0,\dots,255\}^D$; this guarantees that the reverse‑process network always receives inputs on the same scale, starting from the standard normal prior $p(x_T)$.

More expressive decoders such as conditional autoregressive models could replace this Gaussian factorization, but we leave those extensions to future work.

The network is trained to predict the exact Gaussian noise that was added to an image at a randomly chosen diffusion step.

Compute $\sqrt{0.81}=0.9$ and $\sqrt{0.19}\approx0.435$.

Scale the image: $0.9\cdot[120,200]=[108,180]$.

Scale the noise: $0.435\cdot[0.3,-0.5]=[0.1305,-0.2175]$.

Add them: $\tilde{x}=[108.1305,179.7825]$.

The network receives $\tilde{x}$ and timestep $2$ and outputs $\hat{\epsilon}$.

The loss for this sample is $\|\epsilon-\hat{\epsilon}\|_2^{2}$; if $\hat{\epsilon}=[0.28,-0.48]$ the loss equals $(0.02)^2+(0.02)^2=8\times10^{-4}$.

The MSE directly measures how accurately the model can invert the forward diffusion at a specific noise level, and because the target $\epsilon$ is known analytically, no additional supervision is required.

How does this noise‑prediction loss differ from the score‑matching loss used in NCSN?

Both losses ask the network to predict a quantity derived from the forward diffusion, but NCSN predicts the *score* $\nabla_{x_t}\log q_t(x_t)$, which equals $-\epsilon/(\sqrt{1-\bar{\alpha}_t})$. Our loss predicts the raw noise $\epsilon$ itself, avoiding the division by $\sqrt{1-\bar{\alpha}_t}$ and thus eliminating a timestep‑dependent scaling factor.

Because the loss treats every timestep equally, early steps (small $t$) contribute very little useful signal—denoising a barely perturbed image is almost trivial. We therefore down‑weight those terms, analogous to “spotting a typo is easier than writing a typo‑free paragraph” — the easy cases are de‑emphasized so the model focuses on the harder, high‑noise regimes.

**Algorithm 1 Training** 1: **repeat** 2: $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ 3: $t \sim \text{Uniform}(\{1, \dots, T\})$ 4: $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 5: Take gradient descent step on $\nabla_\theta \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t)\|^2$ 6: **until** converged

Table 2 reports ablations of decoder variance (learned diagonal $\Sigma$ vs. fixed isotropic) and of the $\mu_{\theta}$ prediction variant; the latter consistently yields the lowest FID and IS scores, confirming that predicting the mean of the Gaussian decoder is more effective than learning a full covariance.

Experimental Results

We report state‑of‑the‑art FID scores and efficient coding for our DDPM.

The paper’s core idea is to train a network to predict the added noise at each step of the forward diffusion, then run the learned reverse process to generate images.

Our unconditional DDPM attains an FID of 3.17 on CIFAR‑10, surpassing most prior unconditional models.

Table 1 reports an FID = 3.17 for the “Ours (`L_simple`)” entry, which is lower than all other unconditional baselines.

**Table 1.** CIFAR10 results. NLL measured in bits/dim.

**Figure 5.** Unconditional CIFAR10 test set rate-distortion vs. time. Distortion is measured in root mean squared error on a [0, 255] scale. See Table 4 for details.

**Figure 3.** LSUN Church samples. FID=7.89

**Figure 4.** LSUN Bedroom samples. FID=4.90

**Figure 9.** Coarse-to-fine interpolations that vary the number of diffusion steps prior to latent mixing.

DDPM achieves competitive FID scores while offering an efficient, progressive coding scheme.

Extended Results and Visualizations

Supplementary details, broader impact, and acknowledgments for the diffusion model study.

We have demonstrated that diffusion models can produce high‑quality images and reveal links to variational inference, score‑matching, autoregressive models, and progressive lossy compression.

Beyond technical progress, the work raises societal concerns: generative models can be weaponized for misinformation, yet they also enable compression, representation learning, and creative applications.

This research was funded by ONR PECASE, the NSF Graduate Research Fellowship (grant DGE‑1752814), and Google’s TensorFlow Research Cloud.

**Figure 10:** Unconditional CIFAR10 progressive sampling quality over time

**Figure 11.** CelebA-HQ 256 × 256 generated samples

**Figure 1.** (a) Pixel space nearest neighbors (b) Inception feature space nearest neighbors

**Figure 13.** Unconditional CIFAR10 generated samples

**Figure 14.** Unconditional CIFAR10 progressive generation

**Figure 15.** Unconditional CIFAR10 nearest neighbors. Generated samples are in the leftmost column, and training set nearest neighbors are in the remaining columns.

**Figure 16.** LSUN Church generated samples. FID=7.89

**Figure 17.** LSUN Bedroom generated samples, large model. FID=4.90

**Figure 18.** LSUN Bedroom generated samples, small model. FID=6.36

**Figure 19.** LSUN Cat generated samples. FID=19.75

Mathematical Derivations

Extended derivations of the reduced‑variance variational bound for diffusion models.

This appendix reproduces the step‑by‑step algebra that leads to the reduced‑variance variational bound originally presented by Sohl‑Dickstein et al. [53]. The goal is to make explicit how the bound decomposes into Kullback–Leibler (DKL) terms and an expectation that later motivates the simplified training objective.

All of the displayed identities are algebraic rearrangements of the same variational objective; they make explicit which terms are tractable (the per‑step KLs) and which are not (the expectations over $q$). These forms underpin the simplified noise‑prediction loss used throughout the main text.

Implementation Details

Implementation and training specifics for the diffusion models.

Our models adopt the backbone described in PixelCNN++ [52], which combines a U‑Net layout with a Wide ResNet core. We replace the original weight normalization with group normalization to simplify implementation, and each 32×32×256 model contains six residual blocks per resolution level.

Every resolution level includes two convolutional residual blocks, and we insert self‑attention blocks at the 16×16 stage between those convolutions. Diffusion time $t$ is encoded by adding a Transformer sinusoidal position embedding to each residual block.

The CIFAR‑10 configuration holds 35.7 million parameters, while LSUN and CelebA‑HQ models have 114 million; a larger LSUN Bedroom variant reaches roughly 256 million parameters by widening the filter bank. All models operate on four feature‑map resolutions: 32, 64, 128, 256.

Experiments run on TPU v3‑8 devices (approximately eight V100 GPUs). The CIFAR‑10 model trains at 21 steps per second with batch size 128, completing 800 k steps in 10.6 hours; sampling a batch of 256 images takes 17 seconds. LSUN and CelebA‑HQ models train at 2.2 steps per second with batch size 64, and sampling 128 images requires 300 seconds.

Training budgets differ per dataset: CelebA‑HQ 0.5 M steps, LSUN Bedroom 2.4 M, LSUN Cat 1.8 M, LSUN Church 1.2 M, and the larger LSUN Bedroom model 1.15 M steps.

We adopt a linear $\beta_t$ schedule ranging from $10^{-4}$ to $0.02$ with $T=1000$, after evaluating constant and quadratic alternatives. Dropout is set to 0.1 for CIFAR‑10 (selected after a sweep over $\{0.1,0.2,0.3,0.4\}$); all other datasets use zero dropout.

Random horizontal flips are applied during training for all datasets except LSUN Bedroom. After early experiments with Adam and RMSProp, we keep Adam with its default hyper‑parameters; learning rates are $2\times10^{-4}$ for 256‑pixel models and $1\times10^{-4}$ for larger resolutions. Batch sizes are fixed at 128 for CIFAR‑10 and 64 for larger images, and we maintain an EMA decay of $0.9999$.

Model quality is reported using the minimum FID observed during training. For CIFAR‑10 we compute Inception and FID scores on 50 k samples with the OpenAI [51] and TTUR [21] implementations; for LSUN we use the StyleGAN2 [30] codebase. All datasets are loaded via TensorFlow Datasets, and LSUN preprocessing follows the StyleGAN pipeline.

Complete experimental configurations, scripts, and logs are released alongside the source code.

PixelCNN++ is a hierarchical convolutional model that stacks residual blocks in a U‑Net shape, using a Wide ResNet as the encoder‑decoder core to capture multi‑scale image structure.

Related Work Discussion

Comparing our architecture and training to NCSN, highlighting key design differences.

We discuss how our model’s architecture, forward process, and training differ from the Noise Conditional Score Network (NCSN) and why these changes improve sample quality.

NCSN learns a score function for denoising by conditioning on noise level, using a RefineNet architecture with dilated convolutions.

1️⃣ Our U‑Net with self‑attention conditions all layers on $t$ via sinusoidal embeddings, unlike NCSN’s RefineNet which only conditions normalization layers (v1) or the output (v2).

2️⃣ The forward process in diffusion models scales the data by $\sqrt{\beta_t}$ each step to keep variance stable; NCSN omits this scaling, which can cause input variance to grow.

3️⃣ We add a DKL term that forces the forward process to destroy signal, ensuring the prior matches the aggregate posterior, and we keep $\beta_t$ small so the Markov chain remains reversible.

4️⃣ Our Langevin‑like sampler’s learning‑rate and noise‑scale coefficients are derived analytically from $\beta_t$, and the sampler is trained jointly as a latent variable model via variational inference, unlike NCSN’s hand‑crafted post‑hoc coefficients.

Latent structure: sampling splits after the prior draw at $x_T=1000$ produce diverse images, but splits later (e.g., after $x_{750}$) preserve high‑level attributes such as gender and hair color, showing that intermediate latents encode semantic information.

Coarse‑to‑fine interpolation: fewer diffusion steps retain more source structure, enabling fine‑grained interpolation; many steps erase source details, so interpolation yields novel samples rather than pixel‑space mixes.

Read the original paper

Open the simplified reader on Paperglide