Qwen-Image-Flash: Beyond Objective Design

Tianhe Wu, Kun Yan, Zikai Zhou, Lihan Jiang, Jiahao Li, Jie Zhang, Kaiyuan Gao, Ningyuan Tang, Shengming Yin, Xiaoyue Chen, Xiao Xu, Yilei Chen, Yuxiang Chen, Yan Shu, Yixian Xu, Yanran Zhang, Zihao Liu, Zhendong Wang, Zekai Zhang, Deqing Li, Liang Peng, Yi Wang, Jingren Zhou, Chenfei Wu

Qwen-Image-Flash optimizes few-step distillation by balancing training data, multi-teacher guidance, and task mixtures.

How can we improve few-step image generation by optimizing data composition and multi-teacher guidance rather than just distillation objectives?

Few-step distillation for visual generative models often fails when applied to complex, heterogeneous tasks, as standard training recipes struggle to transfer capabilities from multi-step teachers to efficient students. The authors identify that distillation performance depends on the broader training pipeline rather than just the objective, and they introduce Qwen-Image-Flash: a 4-NFE model trained using coherent data selection, step-wise multi-teacher guidance, and balanced task mixtures. This approach allows the student to inherit complementary strengths from specialized teachers while maintaining stable optimization, achieving competitive performance against 80-NFE teacher models.

Paper Primer

The core move is a systems-level redesign of the distillation pipeline. The method uses step-wise multi-teacher guidance: it anchors the student to a stable base teacher while selectively incorporating task-specific score fields from specialized teachers to avoid the instability of naive specialized-teacher distillation.

Qwen-Image-Flash achieves competitive performance against 80-NFE teachers using only 4 function evaluations (NFEs).

Quantitative evaluation on T2I-Bench and Editing-Bench shows the student model surpasses the base teacher in overall ranking while maintaining high visual fidelity. The model achieves an average Gemini 3.1 Pro score of 3.56 on T2I tasks, outperforming the 80-NFE Qwen-Image-2.0-Base teacher.

Data composition is non-intuitive: increasing diversity or adding target-specific data (like text-centric samples) can degrade performance. Instead, coherent single-category data provides a more favorable interface for transferring general synthesis ability, which then generalizes to challenging downstream tasks.

Why does adding more diverse data or specialized text-centric data hurt the student model?

The authors observe that heterogeneous data can dilute or destabilize the transfer process. In few-step distillation, training data determines how the teacher's distributional guidance is exposed to a capacity-limited student; adding irrelevant or conflicting data introduces optimization difficulties that outweigh the benefits of broader coverage.

How does the joint distillation of T2I generation and image editing affect the model's performance?

Joint distillation requires a balanced T2I-to-editing data ratio (e.g., 5:5). Surprisingly, editing supervision provides complementary visual-textual signals that improve the student's T2I generation capability compared to a T2I-only baseline.

Introduction and Motivation

We expose why few‑step distillation stalls without a well‑crafted training recipe.

Few‑step distillation compresses a multi‑step teacher into a fast student, but prior work treats the training recipe as an afterthought. In practice, the quality of the training data and the stability of multi‑teacher guidance dominate the student’s performance. This section isolates those recipe factors and shows how they reshape the distillation problem.

Distilling a large visual generator into a few‑step student succeeds only when the training pipeline is organized around data diversity, teacher expertise, and balanced task mixing, rather than relying solely on a clever loss.

**Figure 1.** Qwen-Image-Flash examples. T2I and instruction-guided editing results with only 4 NFEs, showing unified few-step generation-editing capability.

Shift the focus of few‑step distillation from objective design to a data‑ and guidance‑centric training recipe.

Foundations: Flow Matching and DMD

Introduce Flow Matching and DMD, the two mechanisms enabling few‑step distillation.

The method builds on a continuous‑time transport view and a distribution‑matching distillation to compress a multi‑step teacher into a few‑step student. We first describe the flow‑matching framework that defines a velocity field, then introduce DMD that aligns noisy score fields.

Flow Matching treats generation as moving data points along a prescribed continuous path toward noise, learning the velocity that drives that motion.

How does Flow Matching differ from score‑matching approaches?

Flow Matching learns a vector field that directly predicts the instantaneous displacement between data and noise, whereas score matching learns the gradient of the log‑density; the former integrates an ODE to generate samples, the latter typically uses Langevin dynamics.

Compute $z_{0.5} = (1-0.5)\cdot1 + 0.5\cdot3 = 2$.

True velocity is $\epsilon - x = 2$.

Assume the model predicts $v_\theta(z_{0.5},0.5,c)=1.8$.

Loss contribution: $(1.8 - 2)^2 = 0.04$.

Gradient update reduces the error, pushing $v_\theta$ toward the constant displacement $2$.

The target velocity $\epsilon - x$ is independent of the intermediate location $z_t$, so the network must learn to output the same displacement regardless of $t$.

DMD compresses a multi‑step teacher into a few‑step student by aligning their noisy marginal score fields across intermediate noise levels.

Why does DMD use a score‑difference estimator instead of directly minimizing the KL divergence?

The KL involves the unknown teacher density, which is intractable to evaluate; the score‑difference provides an unbiased gradient that can be computed from samples using a learned score network, avoiding the need to compute the full density.

Student output: $x_\theta = 2$.

Noisy intermediate: $x_t = (1-0.5)\cdot2 + 0.5\cdot0.5 = 1.25$.

Assume teacher score (Gaussian $\mathcal{N}(1,1)$) is $s_{\text{real}} = -(x_t-1) = -0.25$.

Assume student score (standard normal) is $s_{\text{stu}} = -x_t = -1.25$.

Score difference: $s_{\text{stu}} - s_{\text{real}} = -1.0$.

Jacobian $\nabla_\theta x_\theta = 1$, so gradient update = $1 \times (-1.0) = -1.0$.

Matching scores at the noisy intermediate forces the student to shift its output distribution toward the teacher’s bias, even though we never compute the KL directly.

Together, Flow Matching provides a continuous‑time generative backbone, while DMD supplies a tractable way to distill a multi‑step teacher into a few‑step student. These mechanisms form the foundation for the subsequent data‑composition optimizations.

Optimizing Data Composition

Data composition determines T2I distillation performance, with single‑category training often outperforming larger mixes.

We investigate how the composition of distillation data shapes T2I student performance, focusing on both general image generation and challenging text‑centric synthesis.

Choosing which prompt categories to use for distillation directly controls how well the student learns to generate images, especially when rendering text.

Why does adding text‑centric data to a mixed‑category set hurt text rendering instead of helping?

Because the student’s limited capacity and short training horizon cannot simultaneously absorb diverse visual styles and the specialized patterns needed for reliable text synthesis; the heterogeneous signal interferes with the teacher’s guidance, causing optimization instability.

Portrait‑only distillation attains the top overall rank (1) despite using only 20 k samples.

Table 1 shows the portrait‑only experiment (E2) achieving the highest rank among all five configurations.

**Table 1.** Quantitative comparison of T2I distillation under different training data compositions. We evaluate 4-NFE students distilled with different category-specific and mixed-category training sets on landscape, portrait, and text-centric splits of T2I-Bench.

**Figure 2.** Qualitative comparison of T2I distillation under different training data compositions. We compare students distilled with text-centric, mixed-category, landscape-only, landscape-portrait, and portrait-only training data across representative evaluation scenarios. The results show that text-centric or more diverse mixed-category data does not necessarily improve text rendering or overall visual quality. In contrast, students trained on coherent single-category data, such as landscape or portrait prompts, produce more faithful and visually stable results, suggesting stronger cross-category transfer and underscoring the importance of data composition in few-step distillation.

Data diversity is critical for text‑centric synthesis; indiscriminate mixing can degrade performance.

Stabilizing Multi-Teacher Guidance

Step‑wise multi‑teacher guidance stabilizes few‑step distillation while adding complementary expertise.

Few‑step distillation must stay stable across diverse downstream tasks. Directly using a task‑specialized teacher often breaks this stability.

We keep a pretrained base teacher as a stable anchor and gradually blend in task‑specific teachers at selected steps, so the student receives both general and specialized signals without a sudden distribution shift.

Step 1: combined score $s^{(1)}_{\text{real}} = 0.8\,s_0 + 0.2\,s_1$ – the base teacher dominates.

Step 2: combined score $s^{(2)}_{\text{real}} = 0.5\,s_0 + 0.5\,s_1$ – equal contribution, letting the specialized signal shape the student.

Because the weights change smoothly, the student never sees a sudden jump from $s_0$ to $s_1$.

Gradual weight shifting preserves gradient stability while still delivering the specialized teacher’s advantages later in training.

How does this differ from simply switching to the specialized teacher after a few steps?

We never replace the base teacher; instead we blend its score with the specialized teacher’s score at every step. The convex combination guarantees a smooth transition, whereas a hard switch would create a discontinuous change in the guidance distribution, leading to the instability observed in the direct‑specialized experiments.

**Figure 3.** Qualitative comparison of teacher guidance strategies during distillation. (a) Direct guidance from a task-specialized teacher can destabilize training, leading to progressive degradation in alignment and visual quality. (b) Step-wise multi-teacher guidance maintains sample fidelity and layout consistency throughout distillation, yielding better-aligned generations.

Joint Generation and Editing

Joint distillation balances generation and editing, achieving strong performance on both tasks.

A 5:5 T2I:Edit mixture yields the highest Editing‑Bench average rank, surpassing the T2I‑only student.

Table 3 shows the 5:5 model attains an average rank of 3.44 versus 2.87 for the zero‑shot baseline.

The student learns to generate images from text and to edit them under instructions within a few inference steps, by jointly distilling both tasks from a multi‑teacher ensemble.

How does joint generation‑editing distillation differ from training separate generation and editing models?

In joint distillation the student shares a single set of parameters and learns both capabilities simultaneously from a mixed data stream, whereas separate models keep distinct weights and require two inference passes; the joint approach forces the model to reconcile the two objectives, leading to the observed trade‑off.

**Figure 4.** Qualitative comparison of joint T2I-editing distillation under different task-mixture ratios. We compare editing results from the task-specialized teacher, the T2I-only zero-shot student, and jointly distilled students trained with T2I:Edit ratios of 9:1, 7:3, and 5:5 across six editing categories. The balanced 5:5 mixture consistently achieves better instruction following while preserving image fidelity, identity consistency, and stylistic quality, demonstrating the importance of task-ratio selection for unified few-step generation-editing distillation.

**Table.** Performance comparison of different Qwen-Image-Flash ratios on T2I-Bench using Gemini 3.1 Pro and GPT 5.5 metrics.

**Table 4.** Quantitative analysis of T2I performance retention under different T2I-to-edit data mixtures. We evaluate jointly distilled student models trained with varying T2I-to-edit data ratios on T2I-Bench, measuring how well their T2I generation capability is preserved after incorporating editing supervision.

Discussion and Limitations

We examine failed stabilizations, current limits, and future directions for few-step visual distillation.

Directly using a task‑specialized teacher as real‑distribution guidance can misalign structure during few‑step distillation. Adding a flow‑matching loss at the first generation step improves layout consistency but slightly degrades visual quality, revealing a trade‑off between structural stability and fidelity.

The student still struggles with highly detailed text rendering, especially tiny characters and complex poster layouts, where minor shape or spacing errors are perceptible. Incorporating editing data also leaves slight residual noise in some T2I outputs, most visible on clean backgrounds, which can be undesirable for graphic‑design applications.

Few‑step visual generation methods fall into two families: trajectory‑level approaches that compress the teacher’s sampling chain, and distribution‑level approaches that align the generated distribution more directly. Trajectory methods risk propagating solver errors, while distribution methods aim for broader diversity but can be harder to stabilize.

Standard T2I evaluation relies on broad prompt sets and metrics like FID, which miss failure modes that appear at very low NFEs. To address this, we introduce two targeted benchmarks—T2I‑Bench and Editing‑Bench—that stress dense text, structured layouts, and clean‑background fidelity for few‑step generators.

Evaluation Details

Evaluation prompts and scoring tables for T2I‑Bench and Editing‑Bench.

We evaluate both text‑to‑image (T2I) generation and image‑editing using large vision‑language models (Gemini 3.0 Pro and GPT 5.5) as expert judges, following the exact system prompts listed in the tables below.

For T2I‑Bench, the evaluator checks two equally critical dimensions: (1) whether the image faithfully follows every object, attribute, and relationship described in the caption, and (2) whether the output is free of perceptual defects such as geometric distortion, texture melting, or bad anatomy. The judge returns a JSON object with a holistic score and a one‑sentence rationale.

Editing‑Bench requires a more flexible prompt because editing tasks vary widely. We therefore define a meta‑prompt template (Table B) with three placeholders – <category‑title>, <category‑rubric>, and <sub‑score‑criteria> – that are filled in per‑task using the mappings in Tables C and D.

Table C lists the concrete values for each <category‑title> and its associated rubric. For example, “Perceptual enhancement” forbids hallucinating new content, while “Identity‑preserving editing” penalizes any drift in facial features.

Table D defines the sub‑score keys that appear in the JSON output (e.g., “instruction following”, “source subject preservation”). The overall score is derived from holistic reasoning rather than a simple arithmetic mean, with severe failures capping the final value.

To demonstrate the difficulty of the T2I‑Bench, Table E presents two randomly selected hard‑case prompts. Successful generation on these prompts indicates strong compositional control and fine‑grained text rendering.

Finally, Table 5 (reproduced as Table 5 in the appendix) provides the qualitative evaluation criteria for each image‑editing category, ranging from semantic correctness to identity preservation.

The table defines evaluation criteria for various image editing categories. It consists of two columns: **category-title** and **category-rubric**. The rows cover: - **Scene-level semantic transformation**: Focuses on semantic correctness and physical coherence. - **Perceptual image enhancement**: Focuses on clarity without hallucinating new content. - **Object-centric manipulation**: Focuses on plausible filling for deletions and natural integration for additions/replacements. - **Textual content editing**: Focuses on text correctness and readability. - **Identity-preserving editing**: Focuses on maintaining identity and avoiding unnatural anatomy or over-editing. - **Stylistic transfer**: Focuses on changing appearance while maintaining structural consistency.

Read the original paper

Open the simplified reader on Paperglide