InterleaveThinker: Reinforcing Agentic Interleaved Generation

Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

A multi-agent framework that decouples planning from generation to enable coherent, long-horizon interleaved text-image sequences.

How can we enable image generation models to perform multi-step, interleaved generation (text-image-text sequences) rather than just single-shot image creation?

Existing image generators are built for single-shot tasks, and Unified Multimodal Models (UMMs) struggle with long sequences because they rely too heavily on immediate visual feedback, causing them to stall or accumulate errors. InterleaveThinker solves this by decoupling the process into a Planner that maps out the entire sequence upfront and a Critic that iteratively evaluates and refines each step's output without updating the generator itself. This framework allows off-the-shelf image models to perform complex interleaved generation, significantly outperforming existing open-source models and rivaling proprietary frontier systems.

Paper Primer

The system operates as a closed-loop pipeline: the Planner translates user requests into a global execution plan, the Generator executes each step, and the Critic evaluates the output against the plan. If the Critic detects a deviation or quality issue, it generates a refined prompt for the Generator to retry that specific step before the pipeline proceeds.

InterleaveThinker enables robust interleaved generation on frozen, off-the-shelf image generators.

When integrated with FLUX.2-klein, the framework surpasses existing open-source UMMs on the UEval and CoMM benchmarks, achieving performance comparable to proprietary models like Nano Banana. The framework improves reasoning-based benchmarks significantly, increasing WISE scores from 0.47 to 0.73 and RISE scores from 13.3 to 28.9.

To train the agents, the authors constructed three datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k) using a dual-reward strategy. This strategy uses single-step Reinforcement Learning (RL) via Group Relative Policy Optimization (GRPO) to align the Critic's interventions with trajectory-level goals, avoiding the computational cost of end-to-end trajectory optimization.

Why is a multi-agent approach necessary instead of just training a single model to handle interleaved sequences?

A single model tends to suffer from "visual over-reliance," where it myopically reacts to intermediate visual states and loses sight of the global objective. By decoupling planning from execution, the Planner blocks intermediate visual feedback, forcing the system to adhere to the pre-defined trajectory.

Does this framework require fine-tuning the underlying image generator?

No. The framework is model-agnostic and designed to retrofit frozen, off-the-shelf image generators, meaning the base model's weights remain unchanged while the Planner and Critic agents manage the generation workflow.

The framework's reliance on a maximum refinement iteration count ($T_{max}$) is a key lever; while increasing $T_{max}$ improves performance by allowing more correction attempts, it also increases inference latency for long-horizon tasks.

InterleaveThinker demonstrates that complex, long-horizon visual reasoning can be achieved by wrapping frozen generation models in an agentic loop, effectively bypassing the need for end-to-end training of massive multimodal architectures.

The Challenge of Interleaved Generation

Interleaved generation bridges the gap between single‑shot synthesis and complex sequential tasks.

Current image generators excel at producing a single photorealistic image from a prompt, but their architectures force a one‑shot output. This design blocks any workflow that requires a back‑and‑forth of text and images, such as visual storytelling or robotic instruction.

Interleaved generation is a sequence where text and images alternate, each step conditioned on the previous output, enabling coherent multi‑step narratives and actions.

How does Interleaved Generation differ from simply chaining separate text‑to‑image calls?

Chaining treats each call as independent, discarding the semantic link between steps. Interleaved generation preserves a shared trajectory: the planner’s instruction sequence defines a global plan, and the critic enforces adherence, preventing drift that accumulates when calls are isolated.

**Figure 1.** Capabilities of InterleaveThinker, consisting of interleaved generation with various types inputs, real-world action interaction, and robotic manipulation.

**Figure 2.** Problems in image generator and UMM for interleaved generation. Highlight in red boxes.

The architectural gap between single‑shot synthesis and interleaved generation is closed by the Planner‑Critic loop, which supplies global planning and step‑wise correction.

The Planner-Critic Architecture

The method decouples planning from generation via a Planner‑Critic loop that iteratively refines interleaved outputs.

Unified multimodal models (UMMs) suffer from visual over‑reliance: they react to the most recent image instead of the overall goal, and small early errors cascade into large final failures. This makes long‑horizon interleaved generation brittle.

Recent diffusion and autoregressive advances have produced unified image generation and editing models that excel at single‑shot text‑to‑image synthesis, but their architectures remain locked to a one‑shot output regime.

Agentic reinforcement‑learning research shows that multi‑agent loops can improve long‑horizon reasoning, yet no prior work applies such loops to interleaved visual generation.

The system separates “what to do” (Planner) from “how well it was done” (Critic), letting a frozen generator focus on image creation while the loop refines prompts until the Critic approves.

Step 1: Generator receives $r_1^{0}=p_1$ and $I_0$ (blank) → produces $I_1^{1}$ (apple image).

Critic evaluates $I_1^{1}$ vs $p_1$, returns $j_1^{1}= \text{true}$ (accept).

Step 2: Generator receives $r_2^{0}=p_2$ and $I_1^{1}$ → produces $I_2^{1}$ (apple with no bite).

Critic finds mismatch, returns $j_2^{1}= \text{false}$ and refined prompt $r_2^{1}=$ “add a bite to the right side of the apple”.

Generator runs again with $r_2^{1}$ → $I_2^{2}$ (apple with bite).

Critic now returns $j_2^{2}= \text{true}$; loop terminates.

The loop isolates failure to the offending step, allowing targeted prompt refinement without re‑planning the entire sequence.

How does this Planner‑Critic loop differ from a standard reinforcement‑learning agent that repeatedly samples actions?

In RL the policy itself changes after each reward, requiring policy‑gradient updates. Here the Planner is fixed after a single pass; only the prompt (a lightweight text token) is adjusted by the Critic, so the loop incurs negligible training overhead and works with any frozen generator.

Planner parses the interleaved input $S$ and emits the step plan $\{(u_i, p_i, a_i)\}_{i=1}^{N}$.

For each step $i$, initialize $r_i^{0}=p_i$ and set $t=0$.

Generator produces $I_i^{t}$ from $r_i^{t}$ and $I_{i-1}$.

Critic evaluates $I_i^{t}$; if $j_i^{t}$ is true, accept $I_i$ and move to $i+1$; otherwise set $r_i^{t+1}$ and repeat up to $T_{\max}$.

After the final step, concatenate all images $I_i$ and auxiliary texts $a_i$ to form the output sequence.

Planner‑Critic loop – concise pseudocode.

Emu3.5 is a lightweight prompt‑refinement subroutine that rewrites a failing prompt into a more concrete description using a small language model.

Nano Banana Pro is a compact image‑editing module that adds fine‑grained edits (e.g., color tweaks) to the Generator’s output without full re‑generation.

**Figure 3.** Overview of InterleaveThinker. t means the refinement iterations. Fig 4 for inference example.

**Figure 4.** The working flow of InterleaveThinker.

Curating Data for Agentic Planning

How we build the Interleave‑Planner‑SFT‑80k dataset and its filtering pipeline.

Training agents for long‑horizon planning requires aligned triples of instruction, intermediate visual state, and critic judgment. Without such triplets the planner cannot learn to interleave text and images or to refine steps based on feedback.

We first assemble a diverse set of text prompts covering eight high‑level categories and 75 fine‑grained sub‑categories. Gemini 2.5 Pro expands each sub‑category into domain‑specific vocabularies, and 100 instructional templates are populated to yield roughly 40 000 prompts for interleaved generation.

For each prompt, the Planner agent emits a global step‑by‑step instruction sequence, which an image generator executes. The Critic then evaluates each generated image against the instruction, providing a critique that refines the prompt for the next iteration, producing a full trajectory of plans, images, and critiques.

We filter the Critic’s trajectories to discard steps with decreasing scores or low visual quality, retaining only upward or stable high‑score steps. The remaining steps are split by score variance: high‑variance steps form the RL subset, low‑variance steps form the SFT subset, and iteration‑wise judgments are balanced to avoid bias.

Since the Planner’s training data lacks multimodal context, we synthesize interleaved trajectories by truncating a full plan and pairing the preceding multimodal inputs with the subsequent text plan. We augment these self‑synthesized examples with existing open‑source interleaved datasets, yielding a combined corpus of 80 000 Planner‑SFT examples.

**Figure 5.** Illustration of Our Data Construction Pipeline.

Think of the dataset as a cookbook where each recipe step is paired with a taste test—every instruction is aligned with the visual state it produces and the critic’s judgment of that state.

Step 1: Planner emits “show frogspawn” → image generator produces egg cluster → Critic scores 9/10.

Step 2: Planner emits “show tadpole hatching” → image generator produces tadpole → Critic scores 8/10.

Truncate after step 2; the first two multimodal pairs become the input.

Remaining text plan “show froglet then adult frog” is stored as the target output.

Resulting entry: (input = [prompt + egg image + tadpole image], target = “show froglet then adult frog”, critic judgments = [9, 8]).

This tiny construction shows why aligning each instruction with its visual state and critic score is crucial: the Planner learns to continue a sequence from partial multimodal context, something a plain caption‑image pair cannot provide.

How does the Interleave‑Planner‑SFT dataset differ from a standard text‑image pair dataset?

Standard pairs contain only a caption and a single image, whereas our dataset aligns a sequence of instructions, intermediate images, and critic judgments, enabling the planner to learn stepwise refinement rather than one‑shot generation.

Aligned instruction‑state‑critic triplets are essential for training the planner to interleave generation and refine steps.

Training the Planner

We train a planner via supervised fine‑tuning and a critic with a dual‑reward RL loop.

Standard image generators rely heavily on a single visual pass, which leads to over‑reliance on the visual module. By training a planner to emit a full sequence of text‑image instructions up front, we eliminate that bottleneck and give the system a coherent global plan.

Supervised fine‑tuning teaches the planner to turn a complex user request into a complete, ordered list of text‑image steps before any generation happens.

How does Planner‑SFT differ from ordinary supervised fine‑tuning of a language model?

Ordinary SFT predicts the next token given the previous ones, but Planner‑SFT predicts an entire *plan* of interleaved text‑image actions in a single shot. The model therefore learns a global structure rather than a step‑by‑step continuation.

Critic‑SFT starts from the pretrained Qwen3‑VL‑8B‑Instruct model and learns a lightweight evaluation format: given the previous visual state, the current image, the planned instruction, and the generator’s response, it outputs a judgment and a refined prompt for the next generator call.

Long interleaved trajectories (often > 25 generator calls) make end‑to‑end RL prohibitively expensive and create severe credit‑assignment problems. To keep training tractable we split the problem into single‑step RL episodes, each guided by a dual‑reward signal.

Compute the blended inner term: $\alpha R_{\text{acc}} + (1-\alpha) R_{\text{step}} = 0.2 \times (-0.3) + 0.8 \times 0.5 = -0.06 + 0.40 = 0.34$.

Weight the inner term by $0.5$: $0.5 \times 0.34 = 0.17$.

Weight the format reward by $0.5$: $0.5 \times 0.1 = 0.05$.

Sum the two contributions: $0.05 + 0.17 = 0.22$.

The final reward $R$ is $0.22$, which the RL optimizer uses to update the Critic’s policy.

This toy calculation shows how a modest accuracy penalty can be outweighed by a strong step improvement, thanks to the $\alpha$ weighting.

**Table 1.** Comparison on UEval [46]. We evaluate open-source and proprietary frontier models on 8 tasks in UEval. Bold indicates the best result among each group.

Performance Benchmarks

InterleaveThinker sets new records on interleaved benchmarks, especially CoMM.

InterleaveThinker leads the CoMM benchmark with a style score 4.0 points higher than the nearest open‑source competitor.

Table 2 shows the runner‑up style score at 5.6 (MiniGPT‑5), while InterleaveThinker + Qwen‑Image‑Edit‑2511 attains 9.6.

We first evaluate on the UEval benchmark, which tests a model’s ability to generate interleaved text‑image sequences from a pure‑text prompt.

UEval measures how accurately a system can produce a mixed sequence of text and images when only a textual description is given.

The CoMM benchmark assesses interleaved generation when the input already contains images, focusing on style, entity, trend, completeness, image quality, and text‑image alignment.

CoMM evaluates a model’s ability to maintain visual style and semantic consistency across a sequence that mixes images and text.

On UEval, InterleaveThinker + FLUX.2‑klein‑9B outperforms all open‑source baselines and reaches parity with the proprietary Nano Banana.

On CoMM, the InterleaveThinker + Qwen‑Image‑Edit‑2511 configuration dominates every competitor, achieving the highest scores across all six metrics.

WISE, a reasoning‑based image‑generation benchmark, shows a similar uplift: InterleaveThinker improves the base FLUX.2‑klein‑9B model on cultural, time, and space reasoning tasks.

RISE, which tests image‑editing reasoning, mirrors the WISE gains, confirming that the planner‑critic loop benefits both generation and editing pipelines.

**Figure 6.** Comparison with Emu3.5 and Nano Banana Pro in pure-text input interleaved generation.

**Figure 7.** Comparison with Emu3.5 and Nano Banana Pro in multi-modal input interleaved generation.

**Table 2.** Comparison on CoMM [43]. Sty. and Enti. denotes the style and entity consistency among generated images. Tren. denotes the trend alignment betwen image and text squence. Comp. denotes the completeness, ImgQ is the image quality. IRS means text-image alignment score. x/x reflects the model’s performance on interleaved (Task 3) and pure-text (Task 4) inputs.

**Table.** Performance comparison of various proprietary and open-source models across Temporal, Causal, Spatial, Logical, and Overall metrics.

InterleaveThinker’s planner‑critic loop delivers markedly better multi‑modal consistency, setting new standards on both UEval and CoMM.

Ablation and Failure Analysis

Ablation experiments isolate the contribution of each InterleaveThinker component.

We evaluate the impact of each design choice on the UEval benchmark, using FLUX.2‑klein‑9B as the base image generator and reporting oracle upper bounds from Gemini‑2.5‑Pro and GPT‑4.1.

Planner‑SFT raises the Text score from 33.5 to 58.5, demonstrating the planner’s essential role.

Table 5 shows a jump to 58.5 when the Planner‑SFT module is added.

Full‑SFT provides a marginal further gain, reaching 58.6.

Full‑SFT (both Planner and Critic fine‑tuned) records 58.6 in Table 5.

Omitting the step‑wise reward (`R_step`) reduces the average score to 58.2, indicating its importance for prompt refinement.

Table 5 lists “RL w/o step reward” at 58.2.

Removing the accuracy reward (`R_acc`) lowers the score to 58.4, showing that precise evaluation is also critical.

Table 5 lists “RL w/o acc reward” at 58.4.

When the planner’s capabilities are merged into the critic (the “One‑Agent” configuration), performance collapses, confirming that a dedicated critic is necessary for visual quality.

Training the critic on unfiltered data causes it to output trivial constant predictions (e.g., always “True”), which dramatically hurts the overall scores.

**Figure 8.** Failing case of InterleaveThinker+FLUX.2-klein.

Implementation Details and Prompts

Appendix details the system prompts that drive planning, critique, and evaluation of multimodal generation.

The System Prompt defines the overall role of the Task Planner, Orchestrator, and Prompt Engineer, emphasizing analysis, execution planning, and optimization of every step into a concise T2I prompt or editing instruction.

The Planner Pure-Text System Prompt specifies how to translate a user request into a stepwise execution plan, enforcing dynamic step counts, complete final outputs, and strict separation of image generation versus auxiliary text.

The Task Planner, Orchestrator, and Prompt Engineer System expands on input handling, distinguishing Task A (text‑only) from Task B (sequential editing), and mandates prefix‑free fields for instructions, prompts, and auxiliary text.

The Planner Interleaved System Prompt adds the InterleaveThinker agentic loop, directing the system to generate or edit images step by step while preserving the “no prefix” rule.

The Critic System Prompt outlines a binary evaluation of edited images, checking intent matching and anomaly detection, and requires a refined prompt when the result is deemed insufficient.

The Refined VIEScore System Prompt formalizes scoring from 0 to 10 based on intent alignment and collateral‑damage avoidance, providing a structured rubric for final quality assessment.

Questions & answers

What is InterleaveThinker and what does it contribute?

InterleaveThinker is a multi-agent framework for interleaved text-image generation that introduces a Planner agent to map out a full generation sequence upfront and a Critic agent to iteratively evaluate and refine each step's output. It enables frozen, off-the-shelf image generators to handle complex long-horizon tasks without end-to-end retraining of the underlying models.

What problem does InterleaveThinker address?

It addresses the failure of Unified Multimodal Models (UMMs) to perform long-horizon interleaved generation, a problem caused by 'visual over-reliance,' where models myopically react to intermediate visual states and lose sight of the global objective, causing errors to accumulate across steps.

Why is a multi-agent approach necessary instead of training a single model?

A single model suffers from visual over-reliance, reacting to the most recent image rather than the overall goal, which causes small early errors to cascade into large final failures. Decoupling planning from execution forces the system to adhere to a pre-defined global trajectory by blocking intermediate visual feedback from the Planner.

How does the Planner-Critic pipeline work?

The Planner translates a user request into a global step-by-step execution plan, the Generator executes each step, and the Critic evaluates each output against the plan. If the Critic detects a deviation or quality issue, it generates a refined prompt for the Generator to retry that specific step before the pipeline proceeds.

Does InterleaveThinker require fine-tuning the underlying image generator?

No. The framework is model-agnostic and designed to retrofit frozen, off-the-shelf image generators; the base generator's weights remain unchanged while the Planner and Critic agents manage the generation workflow.

How does interleaved generation differ from simply chaining separate text-to-image calls?

Chaining treats each call as independent and discards the semantic link between steps, whereas interleaved generation preserves a shared trajectory in which the Planner's instruction sequence defines a global plan and the Critic enforces adherence, preventing drift that accumulates when calls are isolated.

What datasets were constructed to train InterleaveThinker?

Three datasets were constructed: Interleave-Planner-SFT-80k (80,000 examples of interleaved planning trajectories), Interleave-Critic-SFT-112k (112,000 supervised critic training examples), and Interleave-Critic-RL-13k (13,000 high-variance examples used for reinforcement learning). The prompts span eight high-level categories and 75 fine-grained sub-categories, expanded using Gemini 2.5 Pro.

How is the Critic trained, and what is the dual-reward strategy?

The Critic is initialized from Qwen3-VL-8B-Instruct and trained first with supervised fine-tuning (Critic-SFT) and then with single-step reinforcement learning via Group Relative Policy Optimization (GRPO) using a dual-reward signal. This approach aligns the Critic's interventions with trajectory-level goals while avoiding the computational cost of end-to-end trajectory RL.

What benchmarks were used to evaluate InterleaveThinker?

The paper evaluates on four benchmarks: UEval (interleaved text-image generation from pure-text prompts), CoMM (interleaved generation from image-containing inputs, covering style, entity, trend, completeness, image quality, and text-image alignment), WISE (reasoning-based image generation covering cultural, time, and space tasks), and RISE (image-editing reasoning).

What are the key quantitative results on UEval and CoMM?

On UEval, InterleaveThinker paired with FLUX.2-klein-9B outperforms all open-source baselines and reaches parity with the proprietary Nano Banana system. On CoMM, the InterleaveThinker + Qwen-Image-Edit-2511 configuration achieves the highest scores across all six evaluation metrics, dominating every competitor.

What do the WISE and RISE benchmark results show?

On WISE, InterleaveThinker improves the base FLUX.2-klein-9B model on cultural, time, and space reasoning tasks. On RISE, which tests image-editing reasoning, similar gains are observed, confirming that the Planner-Critic loop benefits both generation and editing pipelines.

What limitations does InterleaveThinker have?

The framework's reliance on a maximum refinement iteration count (T_max) is a key trade-off: increasing T_max improves performance by allowing more correction attempts but also increases inference latency for long-horizon tasks. The paper does not report other explicit limitations beyond this latency-quality trade-off.

What do the ablation studies reveal about the framework's design choices?

Merging the Planner's capabilities into the Critic (the 'One-Agent' configuration) causes performance to collapse, confirming that a dedicated Critic is necessary for visual quality. Additionally, training the Critic on unfiltered data causes it to output trivial constant predictions (e.g., always 'True'), dramatically hurting overall scores.

How does InterleaveThinker differ from standard reinforcement learning agents?

In standard RL, the policy itself changes after each reward via policy-gradient updates. In InterleaveThinker, the Planner is fixed after a single pass and only the prompt (a lightweight text token) is adjusted by the Critic, so the loop incurs negligible training overhead and works with any frozen generator.

How does the Planner-SFT training differ from ordinary supervised fine-tuning of a language model?

Ordinary SFT predicts the next token given previous ones, but Planner-SFT trains the model to predict an entire plan of interleaved text-image actions in a single shot, so the model learns a global structure rather than a step-by-step continuation.

How was the training data filtered and split between SFT and RL subsets?

Critic trajectories with decreasing scores or low visual quality were discarded, retaining only upward or stable high-score steps. The remaining steps were split by score variance: high-variance steps formed the RL subset (Interleave-Critic-RL-13k) and low-variance steps formed the SFT subset, with iteration-wise judgments balanced to avoid bias.

What base models are used within the InterleaveThinker framework?

The Critic agent is initialized from Qwen3-VL-8B-Instruct. Image generators used in experiments include FLUX.2-klein-9B and Qwen-Image-Edit-2511. Oracle upper bounds in ablations are reported from Gemini-2.5-Pro and GPT-4.1.

Where was InterleaveThinker published and by whom?

The paper is available on arXiv at https://arxiv.org/abs/2606.13679. The paper does not explicitly state the authors' names or a conference/journal venue in the provided text.

Key terms

Interleaved Generation: A generation paradigm in which text and images are produced in an alternating, interdependent sequence rather than as a single isolated output.
Unified Multimodal Model (UMM): A single neural model trained to process and generate both text and images within a unified architecture.
Visual Over-Reliance: A failure mode in which a model myopically reacts to the most recent intermediate image rather than maintaining awareness of the overall generation goal.
Planner: The agent in InterleaveThinker responsible for translating a user request into a complete, global step-by-step execution plan before any image is generated.
Critic: The agent in InterleaveThinker that evaluates each generated image against the plan and produces a refined prompt if the output is deemed insufficient, without modifying the generator's weights.
Group Relative Policy Optimization (GRPO): A reinforcement learning algorithm used to train the Critic by optimizing its policy relative to a group of sampled outputs, used here in a single-step episode formulation.
Dual-Reward Strategy: A training signal that combines two complementary reward components to align the Critic's step-level interventions with trajectory-level generation quality goals.
T_max: The maximum number of refinement iterations the Critic is allowed to perform per generation step, controlling the trade-off between output quality and inference latency.
Planner-SFT: A supervised fine-tuning procedure that trains the Planner to predict an entire interleaved text-image action plan in a single forward pass rather than token-by-token continuation.
Critic-SFT: A supervised fine-tuning stage that teaches the Critic to evaluate a generated image against a planned instruction and output a binary judgment plus a refined prompt.
UEval: A benchmark that tests a model's ability to generate interleaved text-image sequences starting from a pure-text prompt.
CoMM: A benchmark that evaluates interleaved generation when the input already contains images, assessing style, entity, trend, completeness, image quality, and text-image alignment.
WISE: A reasoning-based image-generation benchmark that tests performance on cultural, temporal, and spatial reasoning tasks.
RISE: A benchmark that evaluates reasoning capabilities in image-editing pipelines.
VIEScore: A scoring rubric used in InterleaveThinker that rates outputs from 0 to 10 based on intent alignment and avoidance of collateral damage to unintended image regions.
Interleave-Planner-SFT-80k: A dataset of 80,000 interleaved planning trajectories used to train the Planner agent via supervised fine-tuning.
Interleave-Critic-SFT-112k: A dataset of 112,000 examples used to train the Critic agent via supervised fine-tuning on step-level evaluation and prompt refinement.
Interleave-Critic-RL-13k: A dataset of 13,000 high-variance examples used to further train the Critic agent via reinforcement learning.
Qwen3-VL-8B-Instruct: The pretrained vision-language model used as the base for initializing the Critic agent in InterleaveThinker.
FLUX.2-klein-9B: An off-the-shelf image generation model used as a frozen base generator in InterleaveThinker experiments.

Read the original paper

Open the simplified reader on Paperglide

Browse all simplified papers