Representation Forcing for Bottleneck-Free Unified Multimodal Models

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

Representation Forcing eliminates external VAEs in multimodal models by training the decoder to predict its own visual representations.

How can we eliminate the structural bottleneck of frozen VAEs in unified multimodal models without sacrificing image generation quality?

Unified multimodal models rely on frozen, separately pretrained VAEs for image generation, creating a structural bottleneck that limits quality and prevents end-to-end learning. Representation Forcing (RF) solves this by training the model to autoregressively predict its own internal visual representations as intermediate tokens, which then serve as structural scaffolding for pixel-space diffusion. This approach closes the quality gap between pixel-space and VAE-based generation, allowing a fully end-to-end model to match state-of-the-art performance while improving visual understanding.

Paper Primer

RF turns the model's understanding encoder into a generator's guide. By discretizing the encoder's high-level features into tokens, the decoder learns to predict the "blueprint" of an image before rendering the pixels themselves.

The mechanism acts like a structural architect: the representation tokens define the scene's layout and object identity, while the diffusion process acts as the painter that fills in the fine-grained pixel details based on that scaffold.

RF-Pixel generation matches state-of-the-art VAE-based unified models.

On the GenEval benchmark, the RF-Pixel model achieves a score of 0.84 without a rewriter, matching BLIP3-o and outperforming the BAGEL baseline.

RF improves visual understanding performance.

In a controlled study across eight benchmarks, pixel-space models with RF outperformed their VAE-based counterparts on 6 out of 8 tasks, with significant gains in general visual understanding (e.g., +4.3 on MMMU).

Why is an external VAE considered a "bottleneck" in current unified models?

The VAE is optimized for reconstruction rather than the unified model's objectives, and its lossy compression imposes a hard upper bound on generation quality that further training cannot overcome.

Why not just use continuous regression to predict the visual features instead of discrete tokens?

Continuous regression suffers from error accumulation during sequential prediction, whereas discrete tokens reduce each position to a categorical choice, which is more robust and naturally encourages the model to prioritize high-level structure over fine-grained noise.

Researchers can now move toward fully end-to-end multimodal architectures by replacing frozen, external generative components with native, learned representation prediction.

The VAE Bottleneck in Multimodal Models

We expose the VAE bottleneck in unified multimodal models and introduce Representation Forcing to eliminate it.

Current unified multimodal models (UMMs) achieve joint perception and generation by sharing a transformer backbone, yet they still depend on a frozen VAE to decode images. This external component imposes a hard structural bottleneck: the latent space is optimized for reconstruction, not for the unified model’s objectives, limiting generation quality. Simply discarding the VAE forces the model to learn both high‑level semantics and low‑level pixel details from raw data, which creates a noticeable quality gap.

A UMM is a single neural architecture that processes text and images together, using one transformer to both understand inputs and generate outputs.

The VAE bottleneck is the fixed, pretrained encoder–decoder pair that compresses images into a latent code before diffusion, limiting the model’s ability to learn end‑to‑end visual generation.

The VAE encoder reduces 1,024 pixel values to 512 latent values (compression factor ≈ 2×).

The diffusion model treats each latent value as a token, so N = 512.

Attention memory = N² × 4 bytes ≈ 1 MB, which must be allocated for every training step.

This simple calculation shows how a frozen VAE forces the model to allocate a large attention matrix even for modest image sizes, illustrating the structural inefficiency the paper aims to remove.

**Figure 1.** Architectural comparison. (a) Prevailing UMMs rely on a frozen VAE encoder and decoder for image generation, creating a structural bottleneck. (b) Naively removing the VAE and generating directly in pixel space eliminates this bottleneck but loses structural guidance, leading to a quality gap. (c) Representation Forcing closes this gap by training the transformer decoder to autoregressively predict visual representations (Rep head) before pixel generation. These representations are trained to match features from the model's own understanding encoder and remain in context within the shared transformer, providing structural guidance for pixel-space diffusion without any external latent space.

The structural limitation of frozen VAEs in current UMMs prevents end‑to‑end multimodal learning.

The Representation Forcing Mechanism

Align encoder representations with decoder generation to guide pixel‑space diffusion.

The encoder already knows how to map an image to high‑level visual tokens, but the decoder has no way to reproduce that structure when generating from text alone.

Force the decoder to predict the same high‑level visual tokens that the encoder extracts, so generation starts from a structural scaffold before rendering pixels.

Compute cosine similarity between each $f_i$ and all $c_j$; assign $f_1\to c_3$, $f_2\to c_1$, $f_3\to c_4$, $f_4\to c_2$.

Replace each feature by the index of its nearest prototype, producing the token sequence $[3,1,4,2]$.

During training the decoder sees the sequence $[ \text{<text>}, 3,1,4,2, \text{<pixel patches>} ]$ and learns to predict the four indices after the text.

At inference the decoder first generates $[3,1,4,2]$ from the prompt, then feeds them to the diffusion module.

The discrete tokens act as a compact, spatially ordered blueprint; the diffusion process only needs to fill in low‑level pixel details consistent with that blueprint.

How does Representation Forcing differ from ordinary teacher forcing used in language models?

Ordinary teacher forcing feeds the ground‑truth previous token at training time, but the model never learns to generate a separate visual scaffold. Representation Forcing introduces a dedicated visual token stream that the decoder must predict from text, and those tokens become part of the model’s own context at inference, directly shaping the subsequent pixel diffusion.

After the decoder has produced the high‑level representation tokens, it fills in the actual image pixels by running a diffusion process that is conditioned on those tokens.

Why use flow‑matching instead of the more common denoising‑diffusion objective?

Flow‑matching directly predicts the velocity that would move a noisy sample toward the clean image, yielding a simple $L_2$ loss with no need to predict the noise itself. This avoids the variance issues of noise‑prediction losses at early timesteps and fits naturally with the unified autoregressive token stream.

**Figure 3.** Training pipeline of Representation Forcing. Left: The decoder processes a unified sequence of text tokens (T), representation tokens (R), and pixel patches (P) within a shared transformer. Text and representation tokens are predicted autoregressively under next-token prediction ($\mathcal{L}_{LM}$ and $\mathcal{L}_{Rep}$), while pixel patches are generated via bidirectional diffusion from noise ($\mathcal{L}_{FM}$). The image encoder provides continuous visual features to the transformer for understanding tasks. Right: For generation training, an EMA copy of the image encoder extracts features from the ground-truth image, which are discretized via online quantization into representation tokens. These tokens provide both the training targets for $\mathcal{L}_{Rep}$ and the teacher-forcing inputs at R positions. At inference, the right panel is bypassed entirely: the decoder predicts representation tokens from the text prompt alone, and these tokens remain in context to guide pixel-space diffusion.

**Figure 4.** Qualitative comparison of pixel-space generation with and without RF. Without RF, the model tends to produce images with poor structure, such as distorted object shapes and incoherent compositions. With RF, the model generates more coherent structures by first predicting high-level visual representations before pixel rendering, which provides explicit structural guidance for the diffusion process.

Experimental Setup

We detail the architecture, data, training schedule, and baselines used in experiments.

Our system builds on Qwen3-A3B, a 3 B‑parameter Mixture‑of‑Experts language model, and adopts the Mixture‑of‑Transformers (MoT) design where all tokens share self‑attention but are routed to one of three modality‑specific feed‑forward expert pools: understanding, representation prediction, and pixel generation.

The image encoder is DINOv3 ViT‑H+/16 equipped with NaViT‑style variable‑resolution support, and we train it jointly with the language backbone.

For pixel‑space generation we quantize visual features with a codebook of $K = 16{,}384$ prototypes, use $16\times16$ patches, and predict pixel values via x‑prediction with a velocity loss.

A $2\times2$ pooling factor reduces the token sequence length, yielding $N$ representation tokens for every $4N$ pixel patches while preserving a shared spatial layout.

Training data follows the BAGEL pipeline, mixing pure text corpora with large‑scale text–image pairs that span image‑to‑text understanding (VQA, document comprehension, spatial reasoning) and text‑to‑image generation.

We employ a three‑stage training schedule: (i) alignment – with the backbone and encoder frozen, only the MLP connector is trained for $10{,}000$ iterations; (ii) joint pre‑training – all components are unfrozen and optimized on text and text‑image pairs up to $256\times256$ resolution for $50{,}000$ iterations; (iii) continued training – resolution is increased to $1024\times1024$ for a further $20{,}000$ iterations, with image sizes sampled dynamically per stage and batched using NaViT‑style variable‑resolution packing.

For controlled comparison we train VAE‑based baselines using the WanX‑2.1 VAE, swapping pixel inputs/outputs for VAE latents while keeping architecture, data, and optimization identical; the four variants are Pixel, Pixel+RF, VAE, and VAE+RF.

Qwen3-A3B is a 3‑billion‑parameter Mixture‑of‑Experts language model that activates a small subset of expert feed‑forward layers per token, enabling high capacity without proportional compute.

Generation Performance

RF‑Pixel matches SOTA unified models on text‑to‑image benchmarks while staying fully pixel‑space.

RF‑Pixel matches state‑of‑the‑art unified models on GenEval and DPG‑Bench while using no pretrained VAE.

Table 1 shows RF‑Pixel scoring 0.88 on GenEval (with LLM rewriter), comparable to the best VAE‑based unified models.

The improvement stems from aligning encoder and decoder representations, which eliminates the VAE bottleneck that limits conventional unified models.

GenEval measures how well a model composes multiple objects from a textual prompt, while DPG‑Bench tests dense, fine‑grained prompt following.

**Figure 2.** Text-to-image generation results at 1024 × 1024 resolution from our pixel-space unified model with Representation Forcing.

Ablations and Understanding

We assess how removing each component impacts image understanding and generation performance.

The paper’s core premise is that aligning encoder representations with the decoder (Representation Forcing) lets a unified model generate directly in pixel space, bypassing the VAE bottleneck.

RF boosts MMMU by +4.3 points.

Table 2 shows a 4.3‑point gain for the pixel‑space model with RF.

RF lifts both MME and BLINK by +3.6 points each.

Table 2 reports identical 3.6‑point improvements for these two benchmarks.

RF improves AI2D by +4.5 points.

Table 2 records a 4.5‑point jump when RF is added to pixel‑space generation.

RF raises RealWorldQA by +2.7 points.

Table 2 lists a 2.7‑point increase for the pixel‑space model with RF.

RF slightly hurts DocVQA (‑2.0 points).

Table 2 shows a 2‑point drop for the pixel‑space model with RF.

RF slightly hurts ChartQA (‑0.4 points).

Table 2 records a 0.4‑point decline for the pixel‑space model with RF.

RF adds +5.6 points on HalluBench for VAE‑based generation.

Table 2 reports the VAE + RF row beating VAE‑only by 5.6 points.

RF adds +8.0 points on MME for VAE‑based generation.

Table 2 shows an 8‑point gain when RF is applied to the VAE pipeline.

Pixel + RF outperforms VAE + RF on six of eight benchmarks.

Table 2’s per‑benchmark comparison shows six wins for Pixel + RF.

RF raises pixel‑space generation score from 0.25 to 0.76 (+0.51).

Table 3a records 0.25 without RF and 0.76 with RF.

RF raises VAE generation score from 0.52 to 0.77 (+0.25).

Table 3a shows 0.52 without RF and 0.77 with RF.

Figure 4 visualizes the structural collapse of pixel‑space outputs when RF is omitted, confirming the quantitative gap.

RF outperforms REPA (0.76 vs 0.43, +0.33).

Table 3b lists 0.43 for REPA and 0.76 for RF under identical settings.

Discrete token formulation reaches 0.76, far above continuous regression’s 0.26 (+0.50).

Table 3c reports 0.76 for discrete and 0.26 for continuous.

Increasing codebook size to $K\!=\!32{,}768$ yields 0.77 vs 0.76 (+0.01).

Table 3d shows 0.76 for $K\!=\!16{,}384$ and 0.77 for $K\!=\!32{,}768$.

DINOv3 beats SigLIP2 on four of five understanding benchmarks.

Table 3e lists DINOv3 as best in four columns.

Limitations and Discussion

Discussion of limits, implementation specifics, and broader impact of Representation Forcing.

Because of limited compute, we start from a large pretrained language model instead of training a multimodal model from scratch; this gives a strong language‑grounded base but may miss richer joint representations that full‑scale multimodal pretraining could provide.

Our experiments focus exclusively on still‑image generation, leaving video and other temporal modalities to future work.

We train with AdamW ($\beta_{1}=0.9$, $\beta_{2}=0.95$, $\epsilon=10^{-8}$, weight decay $0.1$, gradient clipping $1.0$). The learning rate warms up linearly then stays constant at $5\times10^{-5}$ for Stages 1–2 and $2.5\times10^{-5}$ for Stage 3; generation‑related parameters receive a $4\times$ multiplier while the LLM backbone keeps the base rate.

Sequences of $32{,}768$ tokens are packed using NaViT‑style variable‑resolution batching, and the online vector‑quantization codebook is updated with momentum $0.9999$, one Sinkhorn‑Knopp iteration, and temperature $0.5$; classifier‑free guidance drops the text condition and the full representation token sequence independently with probability $0.1$.

During inference we keep an exponential moving average of parameters (decay $0.9999$) and generate in two stages: first the decoder autoregressively emits the full representation token sequence from the text prompt using top‑k sampling, then a flow‑matching denoiser runs $25$ steps with dynamic timestep shifting to turn Gaussian noise into pixel patches.

Two‑condition classifier‑free guidance is applied with $w_{\text{rep}}=2.0$ for the representation sampling step and $w_{\text{pix}}=3.0$ for the pixel‑patch denoising step.

Algorithm 1 – Online Vector Quantization (PyTorch‑like pseudocode)

Like other text‑to‑image systems, Representation Forcing can be misused to create misleading or harmful visual content; standard safeguards such as safety filters, output watermarking, and controlled access remain applicable.

Read the original paper

Open the simplified reader on Paperglide