InterleaveThinker: Reinforcing Agentic Interleaved Generation
Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li
A multi-agent framework that decouples planning from generation to enable coherent, long-horizon interleaved text-image sequences.
How can we enable image generation models to perform multi-step, interleaved generation (text-image-text sequences) rather than just single-shot image creation?
Existing image generators are built for single-shot tasks, and Unified Multimodal Models (UMMs) struggle with long sequences because they rely too heavily on immediate visual feedback, causing them to stall or accumulate errors. InterleaveThinker solves this by decoupling the process into a Planner that maps out the entire sequence upfront and a Critic that iteratively evaluates and refines each step's output without updating the generator itself. This framework allows off-the-shelf image models to perform complex interleaved generation, significantly outperforming existing open-source models and rivaling proprietary frontier systems.
Paper Primer
The system operates as a closed-loop pipeline: the Planner translates user requests into a global execution plan, the Generator executes each step, and the Critic evaluates the output against the plan. If the Critic detects a deviation or quality issue, it generates a refined prompt for the Generator to retry that specific step before the pipeline proceeds.
InterleaveThinker enables robust interleaved generation on frozen, off-the-shelf image generators.
When integrated with FLUX.2-klein, the framework surpasses existing open-source UMMs on the UEval and CoMM benchmarks, achieving performance comparable to proprietary models like Nano Banana. The framework improves reasoning-based benchmarks significantly, increasing WISE scores from 0.47 to 0.73 and RISE scores from 13.3 to 28.9.
To train the agents, the authors constructed three datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k) using a dual-reward strategy. This strategy uses single-step Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO) to align the Critic's interventions with trajectory-level goals, avoiding the computational cost of end-to-end trajectory optimization.
Why is a multi-agent approach necessary instead of just training a single model to handle interleaved sequences?
A single model tends to suffer from "visual over-reliance," where it myopically reacts to intermediate visual states and loses sight of the global objective. By decoupling planning from execution, the Planner blocks intermediate visual feedback, forcing the system to adhere to the pre-defined trajectory.
Does this framework require fine-tuning the underlying image generator?
No. The framework is model-agnostic and designed to retrofit frozen, off-the-shelf image generators, meaning the base model's weights remain unchanged while the Planner and Critic agents manage the generation workflow.
The framework's reliance on a maximum refinement iteration count ($T_{max}$) is a key lever; while increasing $T_{max}$ improves performance by allowing more correction attempts, it also increases inference latency for long-horizon tasks.
InterleaveThinker demonstrates that complex, long-horizon visual reasoning can be achieved by wrapping frozen generation models in an agentic loop, effectively bypassing the need for end-to-end training of massive multimodal architectures.
The Challenge of Interleaved Generation
Interleaved generation bridges the gap between single‑shot synthesis and complex sequential tasks.
Current image generators excel at producing a single photorealistic image from a prompt, but their architectures force a one‑shot output. This design blocks any workflow that requires a back‑and‑forth of text and images, such as visual storytelling or robotic instruction.
Interleaved generation is a sequence where text and images alternate, each step conditioned on the previous output, enabling coherent multi‑step narratives and actions.
How does Interleaved Generation differ from simply chaining separate text‑to‑image calls?
Chaining treats each call as independent, discarding the semantic link between steps. Interleaved generation preserves a shared trajectory: the planner’s instruction sequence defines a global plan, and the critic enforces adherence, preventing drift that accumulates when calls are isolated.
**Figure 1.** Capabilities of InterleaveThinker, consisting of interleaved generation with various types inputs, real-world action interaction, and robotic manipulation.
**Figure 2.** Problems in image generator and UMM for interleaved generation. Highlight in red boxes.
The architectural gap between single‑shot synthesis and interleaved generation is closed by the Planner‑Critic loop, which supplies global planning and step‑wise correction.
The Planner-Critic Architecture
The method decouples planning from generation via a Planner‑Critic loop that iteratively refines interleaved outputs.
Unified multimodal models (UMMs) suffer from visual over‑reliance: they react to the most recent image instead of the overall goal, and small early errors cascade into large final failures. This makes long‑horizon interleaved generation brittle.
Recent diffusion and autoregressive advances have produced unified image generation and editing models that excel at single‑shot text‑to‑image synthesis, but their architectures remain locked to a one‑shot output regime.
Agentic reinforcement‑learning research shows that multi‑agent loops can improve long‑horizon reasoning, yet no prior work applies such loops to interleaved visual generation.
The system separates “what to do” (Planner) from “how well it was done” (Critic), letting a frozen generator focus on image creation while the loop refines prompts until the Critic approves.
Step 1: Generator receives $r_1^{0}=p_1$ and $I_0$ (blank) → produces $I_1^{1}$ (apple image).
Critic evaluates $I_1^{1}$ vs $p_1$, returns $j_1^{1}= \text{true}$ (accept).
Step 2: Generator receives $r_2^{0}=p_2$ and $I_1^{1}$ → produces $I_2^{1}$ (apple with no bite).
Critic finds mismatch, returns $j_2^{1}= \text{false}$ and refined prompt $r_2^{1}=$ “add a bite to the right side of the apple”.
Generator runs again with $r_2^{1}$ → $I_2^{2}$ (apple with bite).
Critic now returns $j_2^{2}= \text{true}$; loop terminates.
The loop isolates failure to the offending step, allowing targeted prompt refinement without re‑planning the entire sequence.
How does this Planner‑Critic loop differ from a standard reinforcement‑learning agent that repeatedly samples actions?
In RL the policy itself changes after each reward, requiring policy‑gradient updates. Here the Planner is fixed after a single pass; only the prompt (a lightweight text token) is adjusted by the Critic, so the loop incurs negligible training overhead and works with any frozen generator.
Planner parses the interleaved input $S$ and emits the step plan $\{(u_i, p_i, a_i)\}_{i=1}^{N}$.
For each step $i$, initialize $r_i^{0}=p_i$ and set $t=0$.
Generator produces $I_i^{t}$ from $r_i^{t}$ and $I_{i-1}$.
Critic evaluates $I_i^{t}$; if $j_i^{t}$ is true, accept $I_i$ and move to $i+1$; otherwise set $r_i^{t+1}$ and repeat up to $T_{\max}$.
After the final step, concatenate all images $I_i$ and auxiliary texts $a_i$ to form the output sequence.
Planner‑Critic loop – concise pseudocode.
Emu3.5 is a lightweight prompt‑refinement subroutine that rewrites a failing prompt into a more concrete description using a small language model.
Nano Banana Pro is a compact image‑editing module that adds fine‑grained edits (e.g., color tweaks) to the Generator’s output without full re‑generation.
**Figure 3.** Overview of InterleaveThinker. t means the refinement iterations. Fig 4 for inference example.
**Figure 4.** The working flow of InterleaveThinker.
Curating Data for Agentic Planning
How we build the Interleave‑Planner‑SFT‑80k dataset and its filtering pipeline.
Training agents for long‑horizon planning requires aligned triples of instruction, intermediate visual state, and critic judgment. Without such triplets the planner cannot learn to interleave text and images or to refine steps based on feedback.
We first assemble a diverse set of text prompts covering eight high‑level categories and 75 fine‑grained sub‑categories. Gemini 2.5 Pro expands each sub‑category into domain‑specific vocabularies, and 100 instructional templates are populated to yield roughly 40 000 prompts for interleaved generation.
For each prompt, the Planner agent emits a global step‑by‑step instruction sequence, which an image generator executes. The Critic then evaluates each generated image against the instruction, providing a critique that refines the prompt for the next iteration, producing a full trajectory of plans, images, and critiques.
We filter the Critic’s trajectories to discard steps with decreasing scores or low visual quality, retaining only upward or stable high‑score steps. The remaining steps are split by score variance: high‑variance steps form the RL subset, low‑variance steps form the SFT subset, and iteration‑wise judgments are balanced to avoid bias.
Since the Planner’s training data lacks multimodal context, we synthesize interleaved trajectories by truncating a full plan and pairing the preceding multimodal inputs with the subsequent text plan. We augment these self‑synthesized examples with existing open‑source interleaved datasets, yielding a combined corpus of 80 000 Planner‑SFT examples.
**Figure 5.** Illustration of Our Data Construction Pipeline.
Think of the dataset as a cookbook where each recipe step is paired with a taste test—every instruction is aligned with the visual state it produces and the critic’s judgment of that state.
Step 1: Planner emits “show frogspawn” → image generator produces egg cluster → Critic scores 9/10.
Step 2: Planner emits “show tadpole hatching” → image generator produces tadpole → Critic scores 8/10.
Truncate after step 2; the first two multimodal pairs become the input.
Remaining text plan “show froglet then adult frog” is stored as the target output.
Resulting entry: (input = [prompt + egg image + tadpole image], target = “show froglet then adult frog”, critic judgments = [9, 8]).
This tiny construction shows why aligning each instruction with its visual state and critic score is crucial: the Planner learns to continue a sequence from partial multimodal context, something a plain caption‑image pair cannot provide.
How does the Interleave‑Planner‑SFT dataset differ from a standard text‑image pair dataset?
Standard pairs contain only a caption and a single image, whereas our dataset aligns a sequence of instructions, intermediate images, and critic judgments, enabling the planner to learn stepwise refinement rather than one‑shot generation.
Aligned instruction‑state‑critic triplets are essential for training the planner to interleave generation and refine steps.
Training the Planner
We train a planner via supervised fine‑tuning and a critic with a dual‑reward RL loop.
Standard image generators rely heavily on a single visual pass, which leads to over‑reliance on the visual module. By training a planner to emit a full sequence of text‑image instructions up front, we eliminate that bottleneck and give the system a coherent global plan.
Supervised fine‑tuning teaches the planner to turn a complex user request into a complete, ordered list of text‑image steps before any generation happens.
How does Planner‑SFT differ from ordinary supervised fine‑tuning of a language model?
Ordinary SFT predicts the next token given the previous ones, but Planner‑SFT predicts an entire *plan* of interleaved text‑image actions in a single shot. The model therefore learns a global structure rather than a step‑by‑step continuation.
Critic‑SFT starts from the pretrained Qwen3‑VL‑8B‑Instruct model and learns a lightweight evaluation format: given the previous visual state, the current image, the planned instruction, and the generator’s response, it outputs a judgment and a refined prompt for the next generator call.
Long interleaved trajectories (often > 25 generator calls) make end‑to‑end RL prohibitively expensive and create severe credit‑assignment problems. To keep training tractable we split the problem into single‑step RL episodes, each guided by a dual‑reward signal.
Compute the blended inner term: $\alpha R_{\text{acc}} + (1-\alpha) R_{\text{step}} = 0.2 \times (-0.3) + 0.8 \times 0.5 = -0.06 + 0.40 = 0.34$.
Weight the inner term by $0.5$: $0.5 \times 0.34 = 0.17$.
Weight the format reward by $0.5$: $0.5 \times 0.1 = 0.05$.
Sum the two contributions: $0.05 + 0.17 = 0.22$.
The final reward $R$ is $0.22$, which the RL optimizer uses to update the Critic’s policy.
This toy calculation shows how a modest accuracy penalty can be outweighed by a strong step improvement, thanks to the $\alpha$ weighting.
**Table 1.** Comparison on UEval [46]. We evaluate open-source and proprietary frontier models on 8 tasks in UEval. Bold indicates the best result among each group.
Performance Benchmarks
InterleaveThinker sets new records on interleaved benchmarks, especially CoMM.
InterleaveThinker leads the CoMM benchmark with a style score 4.0 points higher than the nearest open‑source competitor.
Table 2 shows the runner‑up style score at 5.6 (MiniGPT‑5), while InterleaveThinker + Qwen‑Image‑Edit‑2511 attains 9.6.
We first evaluate on the UEval benchmark, which tests a model’s ability to generate interleaved text‑image sequences from a pure‑text prompt.
UEval measures how accurately a system can produce a mixed sequence of text and images when only a textual description is given.
The CoMM benchmark assesses interleaved generation when the input already contains images, focusing on style, entity, trend, completeness, image quality, and text‑image alignment.
CoMM evaluates a model’s ability to maintain visual style and semantic consistency across a sequence that mixes images and text.
On UEval, InterleaveThinker + FLUX.2‑klein‑9B outperforms all open‑source baselines and reaches parity with the proprietary Nano Banana.
On CoMM, the InterleaveThinker + Qwen‑Image‑Edit‑2511 configuration dominates every competitor, achieving the highest scores across all six metrics.
WISE, a reasoning‑based image‑generation benchmark, shows a similar uplift: InterleaveThinker improves the base FLUX.2‑klein‑9B model on cultural, time, and space reasoning tasks.
RISE, which tests image‑editing reasoning, mirrors the WISE gains, confirming that the planner‑critic loop benefits both generation and editing pipelines.
**Figure 6.** Comparison with Emu3.5 and Nano Banana Pro in pure-text input interleaved generation.
**Figure 7.** Comparison with Emu3.5 and Nano Banana Pro in multi-modal input interleaved generation.
**Table 2.** Comparison on CoMM [43]. Sty. and Enti. denotes the style and entity consistency among generated images. Tren. denotes the trend alignment betwen image and text squence. Comp. denotes the completeness, ImgQ is the image quality. IRS means text-image alignment score. x/x reflects the model’s performance on interleaved (Task 3) and pure-text (Task 4) inputs.
**Table.** Performance comparison of various proprietary and open-source models across Temporal, Causal, Spatial, Logical, and Overall metrics.
InterleaveThinker’s planner‑critic loop delivers markedly better multi‑modal consistency, setting new standards on both UEval and CoMM.
Ablation and Failure Analysis
Ablation experiments isolate the contribution of each InterleaveThinker component.
We evaluate the impact of each design choice on the UEval benchmark, using FLUX.2‑klein‑9B as the base image generator and reporting oracle upper bounds from Gemini‑2.5‑Pro and GPT‑4.1.
Planner‑SFT raises the Text score from 33.5 to 58.5, demonstrating the planner’s essential role.
Table 5 shows a jump to 58.5 when the Planner‑SFT module is added.
Full‑SFT provides a marginal further gain, reaching 58.6.
Full‑SFT (both Planner and Critic fine‑tuned) records 58.6 in Table 5.
Omitting the step‑wise reward (`R_step`) reduces the average score to 58.2, indicating its importance for prompt refinement.
Table 5 lists “RL w/o step reward” at 58.2.
Removing the accuracy reward (`R_acc`) lowers the score to 58.4, showing that precise evaluation is also critical.
Table 5 lists “RL w/o acc reward” at 58.4.
When the planner’s capabilities are merged into the critic (the “One‑Agent” configuration), performance collapses, confirming that a dedicated critic is necessary for visual quality.
Training the critic on unfiltered data causes it to output trivial constant predictions (e.g., always “True”), which dramatically hurts the overall scores.
**Figure 8.** Failing case of InterleaveThinker+FLUX.2-klein.
Implementation Details and Prompts
Appendix details the system prompts that drive planning, critique, and evaluation of multimodal generation.
The System Prompt defines the overall role of the Task Planner, Orchestrator, and Prompt Engineer, emphasizing analysis, execution planning, and optimization of every step into a concise T2I prompt or editing instruction.
The Planner Pure-Text System Prompt specifies how to translate a user request into a stepwise execution plan, enforcing dynamic step counts, complete final outputs, and strict separation of image generation versus auxiliary text.
The Task Planner, Orchestrator, and Prompt Engineer System expands on input handling, distinguishing Task A (text‑only) from Task B (sequential editing), and mandates prefix‑free fields for instructions, prompts, and auxiliary text.
The Planner Interleaved System Prompt adds the InterleaveThinker agentic loop, directing the system to generate or edit images step by step while preserving the “no prefix” rule.
The Critic System Prompt outlines a binary evaluation of edited images, checking intent matching and anomaly detection, and requires a refined prompt when the result is deemed insufficient.
The Refined VIEScore System Prompt formalizes scoring from 0 to 10 based on intent alignment and collateral‑damage avoidance, providing a structured rubric for final quality assessment.