Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

A two-stage generative model that produces images by first predicting CLIP image embeddings from text.

How can we use CLIP's joint image-text embedding space to build a hierarchical, text-conditional image generation system that outperforms monolithic models?

Text-to-image models often struggle to balance high-fidelity photorealism with the diversity of generated scenes. The authors introduce unCLIP, a two-stage stack that first uses a "prior" model to generate a CLIP image embedding from a caption, then uses a diffusion decoder to invert that embedding into an image. This hierarchical approach decouples semantic generation from pixel-level synthesis, resulting in significantly higher sample diversity than previous state-of-the-art models while maintaining comparable photorealism.

Paper Primer

The core move is the introduction of a prior model that maps text captions into the continuous CLIP image embedding space. By generating this intermediate representation first, the subsequent diffusion decoder can focus on rendering the visual details, allowing for semantic manipulations like interpolations and text-guided edits by traversing the CLIP latent space.

unCLIP achieves superior sample diversity compared to GLIDE.

Human evaluators preferred unCLIP's diversity in 70.5% of pairwise comparisons against GLIDE using a diffusion prior. Strong preference (70.5% ± 2.8%) for unCLIP over GLIDE in diversity-focused human evaluations.

Why use a two-stage hierarchical approach instead of training a single model to map text directly to pixels?

The authors find that explicitly generating CLIP image embeddings first allows the model to "freeze" the semantic content of the scene, which prevents the semantic collapse often seen in other models when increasing guidance scales to improve photorealism.

What is the primary trade-off or limitation of this architecture?

The model struggles with attribute binding (e.g., assigning specific colors to specific objects) and rendering coherent text, likely because the CLIP embedding does not explicitly encode spelling or precise spatial relationships between objects and their attributes.

Researchers can now treat CLIP latent space as a generative target, enabling a new class of image manipulation tools that leverage the joint text-image embedding space for zero-shot semantic editing.

Introduction and Motivation

We expose limits of monolithic text‑to‑image models and motivate a hierarchical latent‑based approach.

Current text‑to‑image systems are monolithic: a single network must simultaneously learn to understand a caption, locate visual concepts, and render a photorealistic picture, which forces a trade‑off between semantic fidelity and visual diversity.

CLIP maps an image and its caption into a common vector where distance reflects semantic similarity, so the same vector can stand in for the image during generation.

Compute the dot‑product between every pair of tokens → a matrix with `$1{,}048{,}576^2 \approx 1.1 \times 10^{12}$ entries.

Each entry occupies 4 bytes (float32), so memory = $1.1 \times 10^{12}$ × 4 ≈ `$4.4 \times 10^{12}$ bytes ≈ `$4$ TB.

Even with 16‑bit precision the requirement stays above $200$ GB, far beyond typical GPU capacity.

This illustrates why generating images directly in pixel space is infeasible at high resolution, motivating a compact latent target such as a CLIP embedding.

**Figure 1.** Selected 1024 × 1024 samples from a production version of our model.

The key shift is moving from a monolithic generator to a hierarchical pipeline that first predicts a compact CLIP embedding and then decodes it into an image.

The unCLIP Architecture

Balancing fidelity and diversity requires a two‑stage generative stack.

Single‑stage text‑to‑image models either miss fine‑grained semantics or collapse into low‑variance samples. The core difficulty is that the same network must learn a joint distribution over language and pixels, which forces a trade‑off between fidelity and diversity.

Sample a CLIP image embedding $z_i$ from the Prior conditioned on caption $y$.

Feed $z_i$ (and optionally $y$) to the Decoder diffusion model.

Run the diffusion sampler to obtain an image $x$.

If higher resolution is required, pass $x$ through the two upsampler stages.

The Prior learns to map a caption directly into the CLIP image space, so the downstream image generator can work with a compact, semantically rich representation.

Original embedding $z_i = (0.9, -0.2, 0.4, 0.1, -0.5, 0.3, 0.0, 0.2)$.

PCA projects to $v = (1.2, -0.8, 0.5)$ (the three largest eigenvectors).

Quantization maps $1.2\to 3$, $-0.8\to 0$, $0.5\to 2$ (bucket indices 0–3).

The resulting token sequence is $[3,0,2]$ plus a dot‑product token $z_i\!\cdot\!z_t = 0.42$.

The AR prior predicts this sequence autoregressively; the diffusion prior would predict $v$ directly.

Reducing dimensionality and discretizing the embedding shrinks the autoregressive token length by a factor of ≈3, speeding inference while preserving >99 % of the original information.

How does the AR prior differ from a standard language model that predicts word tokens?

A language model predicts discrete vocabulary tokens; the AR prior predicts quantized components of a continuous image embedding. The token alphabet is numeric (bucket indices) rather than lexical, and the conditioning includes a dot‑product token that has no analogue in ordinary text generation.

The Decoder turns a CLIP image embedding into a pixel image by running a diffusion process that is guided by that embedding.

Initialize a $4\times4$ noise tensor $N_0$.

At each timestep $t$, compute a timestep embedding $e_t$ and add $z_i$ (projected to the same dimension) to obtain $e'_t$.

Run the UNet denoiser with $e'_t$ to produce $N_{t-1}$.

After 10 steps, the noise collapses to a coherent image $x$.

Upsample $x$ to $8\times8$ using the first upsampler (Gaussian blur applied to the conditioning image).

Upsample again to $16\times16$ with the second upsampler (BSR degradation applied).

The conditioning embedding steers the denoising trajectory, so even a tiny 2‑dimensional vector can control high‑resolution output when injected at every diffusion step.

Why add the CLIP embedding to the timestep embedding instead of using it only as a static conditioning vector?

Injecting $z_i$ into the timestep embedding lets the diffusion model modulate its denoising dynamics at each step, effectively shaping the entire trajectory rather than providing a single static bias. This yields finer control over the generated image than a simple concatenation would.

Latent Space Manipulations

Split an image into a CLIP‑recognized part and a residual, then manipulate each independently.

Monolithic text‑to‑image pipelines cannot cleanly separate what CLIP already knows about an image from the extra detail a decoder needs. By representing an image as a pair $(z_i, x_T)$ we isolate the CLIP‑recognizable content ($z_i$) and the residual information ($x_T$) required for faithful reconstruction.

The image is split into a CLIP‑derived embedding that captures recognizable semantics, and a separate DDIM‑inverted latent that stores everything else needed to rebuild the exact pixel‑level picture.

Step 1: Store $z_i$ as the semantic part.

Step 2: Store $x_T$ as the residual part.

Step 3: To reconstruct, feed both $z_i$ and $x_T$ into the decoder, which outputs the original pixel matrix.

Step 4: If we replace $x_T$ with a new residual $x_T' = (0.6, -0.2)$ while keeping $z_i$, the decoder produces a visually similar image with slightly altered colors.

Step 5: If we replace $z_i$ with $z_i' = (0.0, 0.0, 0.0, 0.0)$ and keep $x_T$, the decoder outputs a blurry version that lacks recognizable objects.

The pair $(z_i, x_T)$ cleanly isolates semantic content from fine‑grained detail, making it possible to vary one without disturbing the other.

With the bipartite representation in hand, we can generate controlled variations by fixing the CLIP part and sampling stochastic DDIM noise.

Keep the CLIP embedding $z_i$ fixed and inject randomness through the DDIM noise scale $\eta$; the decoder then produces a family of images that share the same high‑level semantics but differ in low‑level details.

**Figure 3.** Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details.

Beyond pure variations, we can blend two images by interpolating their CLIP embeddings, optionally also interpolating their DDIM latents.

Rotate smoothly between the CLIP embeddings of two images; the decoder then materializes the intermediate semantics while optionally keeping the DDIM noise fixed or also interpolated.

**Figure 4.** Variations between two images by interpolating their CLIP image embedding and then decoding with a diffusion model. We fix the decoder seed across each row. The intermediate variations naturally blend the content and style from both input images.

Because CLIP embeds text and images in the same space, we can steer an image toward a new textual description by computing a normalized text difference and interpolating toward it.

Subtract the CLIP embedding of the current caption from that of a target caption, normalize the difference, and rotate the image’s CLIP embedding toward this direction while keeping the residual DDIM noise fixed.

**Figure.** Image-to-image translation results showing progressive transformations between two concepts: (1) a cat to a super saiyan cat, (2) a Victorian house to a modern house, (3) an adult lion to a lion cub, and (4) a winter landscape to a fall landscape.

Beyond manipulation, the bipartite pipeline lets us inspect what CLIP actually encodes by visualizing reconstructions from its latent space.

**Figure 6.** Variations of images featuring typographic attacks [20] paired with the CLIP model's predicted probabilities across three labels. Surprisingly, the decoder still recovers Granny Smith apples even when the predicted probability for this label is near 0%. We also find that our CLIP model is slightly less susceptible to the "pizza" attack than the models investigated in [20].

**Figure 7.** Visualization of reconstructions of CLIP latents from progressively more PCA dimensions (20, 30, 40, 80, 120, 160, 200, 320 dimensions), with the original source image on the far right. The lower dimensions preserve coarse-grained semantic information, whereas the higher dimensions encode finer-grained details about the exact form of the objects in the scene.

The Role of the Prior

Assessing how the prior influences image quality and alignment.

The decoder can operate without the prior by dropping the CLIP image embedding 5 % of the time during training, enabling classifier‑free guidance, but this leads to noticeably poorer caption alignment.

Human evaluators prefer the full unCLIP stack for photorealism.

57.0 % ± 3.1 % of participants chose unCLIP over the alternatives.

Human evaluators also favor the full unCLIP stack for caption similarity.

53.1 % ± 3.1 % of participants judged unCLIP captions more faithful.

Across all ablations the diffusion‑based prior consistently beats the autoregressive prior for comparable model size and training compute.

The prior is optional for generation but critical for text‑conditional alignment.

Human Evaluation Results

Human studies show unCLIP matches GLIDE in photorealism while surpassing it in diversity.

The prior maps text to CLIP image embeddings, and the decoder turns those embeddings into images; this two‑stage stack lets us assess generation quality with human judges.

unCLIP’s diffusion prior is preferred over GLIDE for diversity with probability 70.5 % (± 2.8 %).

Human judges chose the diffusion‑prior samples as more diverse in 70.5 % of pairwise comparisons (95 % CI ± 2.8 %).

Across all three metrics the diffusion prior consistently outperforms the AR prior, while GLIDE remains marginally ahead on photorealism.

**Table 1.** Human evaluations comparing unCLIP to GLIDE. We compare to both the AR and diffusion prior for unCLIP. Reported figures are 95% confidence intervals of the probability that the unCLIP model specified by the row beats GLIDE. Sampling hyperparameters for all models were swept to optimize an automated proxy for human photorealism evaluations.

Diversity and Fidelity

unCLIP attains a 3.6× lower FID than GLIDE while preserving diversity under guidance.

unCLIP (diffusion prior) achieves a zero‑shot FID of 7.55 on MS‑COCO 256×256, which is 3.6× lower than GLIDE’s 20.79 under the same guidance scale.

Table 2 reports FID 7.55 for unCLIP (diffusion prior) versus 20.79 for GLIDE when both use guidance scale 1.25.

Benchmark Comparisons

We report unCLIP’s zero‑shot FID and aesthetic quality on MS‑COCO.

We evaluate unCLIP on the MS‑COCO validation set using the standard FID metric.

unCLIP (Diffusion prior) achieves a new zero‑shot FID of 10.39 on MS‑COCO 256 × 256, surpassing DALL‑E and GLIDE.

Table 2 shows unCLIP at 10.39 while DALL‑E records 13.2 and GLIDE 12.5.

**Table 2.** Comparison of FID on MS-COCO 256 × 256. We use guidance scale 1.25 for the decoder for both the AR and diffusion prior, and achieve the best results using the diffusion prior.

**Figure 12.** Random image samples on MS-COCO prompts.

For aesthetic quality we train a CLIP linear probe on AVA and score 2048 generated images per model.

Guidance raises the predicted aesthetic score for both GLIDE and unCLIP, and for unCLIP the improvement comes solely from guiding the decoder—guiding the prior harms quality.

Related Work and Limitations

unCLIP splits text‑to‑image generation into a CLIP‑based prior and a decoder, trading off fidelity for diversity.

Early text‑conditional generators adapted unconditional techniques: GANs trained on captioned datasets, VQ‑VAEs paired with autoregressive transformers, and diffusion models equipped with auxiliary text encoders.

Hierarchical pipelines later emerged, first sampling a coarse latent (Razavi et al.) or a low‑resolution VAE code (Child et al.) and then conditioning higher‑resolution stages on that representation (Gafni et al.).

Modeling in a latent space also yields compute savings: Preechakul et al. use an autoencoder whose latents are rendered by a diffusion model, while Vahdat et al. and Rombach et al. apply diffusion directly to VQ‑GAN latents.

Guidance mechanisms exploit CLIP or classifier signals. Galatolo et al. and Patashnik et al. back‑propagate CLIP gradients into GANs; Dhariwal & Nichol introduce classifier guidance for diffusion; Ho & Salimans replace the external classifier with classifier‑free guidance.

Several works condition generation directly on CLIP embeddings—Zhou et al. perturb image embeddings for GANs, Crowson conditions diffusion on CLIP text embeddings, and Wang et al. feed CLIP image embeddings to an autoregressive model.

Bordes et al. demonstrate a two‑stage scheme that samples image representations via kernel‑density estimation and feeds them to a diffusion decoder, a strategy reminiscent of our prior‑decoder stack but limited to image‑only representations.

GLIDE is a diffusion model that combines classifier‑free guidance with a text encoder, allowing it to steer image synthesis toward a prompt while preserving diversity.

unCLIP’s reliance on CLIP embeddings introduces three concrete weaknesses: (1) it often swaps colors or shapes when binding attributes to objects; (2) it fails to render exact text strings because the embedding does not preserve spelling; (3) its decoder starts from a 64 × 64 canvas, limiting fine‑grained detail in crowded scenes.

These shortcomings raise downstream risks: more realistic outputs make it easier to pass off synthetic images as real, and the model can inherit and amplify biases present in the training data.

Assessing these risks requires considering the deployment context—training corpus, safety guardrails, user access policies, and the specific application domain. Mishkin et al. provide an early risk analysis for the DALL·E 2 Preview, the first platform to host an unCLIP model.

Appendix C

Training the unCLIP stack pushes memory and compute to the limits of current hardware.

Training unCLIP at the scales reported in the paper pushes memory and compute to the limits of current hardware: a single forward pass of the 3.5 B‑parameter decoder alone consumes several gigabytes of GPU memory, and the full training pipeline requires thousands of GPU‑hours.

The full unCLIP training consists of three stages—learning a CLIP image encoder, training a diffusion decoder, and fitting a prior that maps text embeddings to image embeddings—each with its own data source and schedule.

Forward‑pass memory for weights: $P$ × 4 bytes ≈ 40 MB.

Activations per layer: $B \times S \times H$ = 2 × 256 × 64 = 32 k floats ≈ 128 KB; across 12 diffusion steps this totals ≈ 1.5 MB.

Total GPU memory per step ≈ 40 MB + 1.5 MB ≈ 41.5 MB, well within a single GPU.

Scaling to the paper’s $P_{\text{real}}\approx3.5\times10^{9}$ parameters multiplies the weight memory by 350, yielding ≈ 14 GB, which dominates the memory budget.

The example shows why the authors emphasize memory‑efficient schedules and why a production version must modify architecture or training length to stay within hardware limits.

**Table 3.** Hyperparameters for the models

Read the original paper

Open the simplified reader on Paperglide