LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Pipei Huang, Hao Jiang

LoomVideo uses a zero-overhead conditioning mechanism to unify video generation and editing at 5B parameters.

How can we unify video generation and editing into a single, efficient model that handles interleaved multimodal inputs without relying on massive, multi-billion parameter architectures?

Unified video editing models typically concatenate source and target video tokens, which doubles sequence length and quadruples the computational cost of self-attention. LoomVideo replaces token concatenation with a Scale-and-Add mechanism: it scales the clean source video latent by the current timestep and adds it directly to the noised target latent, introducing zero additional tokens. This 5B-parameter architecture achieves state-of-the-art performance on e-commerce benchmarks while delivering at least a 5.41× inference speedup over concatenation-based models.

Paper Primer

The core mechanism hinges on two architectural shifts: Deepstack injection and Scale-and-Add conditioning. Deepstack injection extracts hidden states from every layer of the Multimodal Large Language Model (MLLM) and injects them into the corresponding layers of the Diffusion Transformer (DiT) via cross-attention, ensuring deep semantic alignment. Scale-and-Add conditioning is the primary efficiency move: it treats the source video as a latent-space bias rather than an input sequence, effectively bypassing the quadratic complexity of attention-based token concatenation.

LoomVideo achieves significantly faster inference speeds than existing unified models.

Comparison against the fastest baseline, OmniWeaving, on T2V generation and video editing tasks. 6.24× speedup for T2V generation and 5.41× for video editing.

The model maintains competitive or superior performance on complex editing tasks despite its smaller parameter count.

Outperforms the second-place VINO by 7% on RefVIE-Bench and secures top scores on the custom FashionVideoBench.

Why does the paper replace the standard T5 text encoder with an MLLM?

The authors argue that relying on a standard text encoder limits the generative prior's ability to utilize rich, hierarchical semantic information. Using an MLLM allows for deeper multimodal interaction, which is essential for complex, instruction-following editing tasks.

What is the role of the Negative Temporal RoPE index?

It allows the model to distinguish between target video frames and reference images by assigning them distinct, non-overlapping positional indices. This enables robust multi-image guidance without disrupting the spatiotemporal dynamics of the video generation process.

Motivation and Problem Framing

Introducing LoomVideo, a compact unified video model that overcomes the cost of massive‑parameter designs.

Current unified video models depend on huge backbones (≈13 B parameters) and concatenate source‑video tokens with target tokens for editing. This doubles the token length and inflates self‑attention cost by four‑fold, making training and inference prohibitively expensive.

LoomVideo keeps the expressive power of large video foundations while cutting compute by integrating a Multimodal Large Language Model (MLLM) and lightweight conditioning tricks.

**Figure 1.** Showcase of LoomVideo across diverse video generation and editing scenarios, including foundational video generation & editing, reference-image-guided video generation and editing.

The Deepstack Injection Mechanism

Method details the core Deepstack injection and supporting designs that enable efficient multimodal video generation.

Using only the final‑layer embedding of the MLLM leaves most of the hierarchical semantic information unused, which limits the model’s ability to fuse text, images, and video cues.

Instead of a single “summary” vector, the model injects a stack of layer‑wise embeddings from the MLLM into the corresponding layers of the DiT, letting fine‑grained multimodal cues influence generation at every depth.

Layer 1 hidden state $h_1^{\text{mllm}} = (1,0,0,1)$ → MLP output $c_1 = (0.5,\,0.5)$.

Cross‑attention uses $c_1$ as key/value, updating DiT layer 1 feature $d_1^{\text{dit}}$.

Layer 2 hidden state $h_2^{\text{mllm}} = (0,1,1,0)$ → $c_2 = (0.5,\,0.5)$ (same projection).

Cross‑attention injects $c_2$ into DiT layer 2, producing $d_2^{\text{dit}}$.

Layer 3 hidden state $h_3^{\text{mllm}} = (1,1,0,0)$ → $c_3 = (0.5,\,0.5)$.

Final DiT layer 3 receives $c_3$, completing the deep stack injection.

Even with a tiny MLP, injecting layer‑wise embeddings lets the diffusion backbone attend to nuanced multimodal signals that a single final vector would miss.

How does Deepstack injection differ from the simpler “final‑layer only” conditioning used in earlier video models?

The earlier approach feeds only $h_{L}^{\text{mllm}}$ (the top‑most hidden state) into the diffusion model, discarding intermediate representations that capture lower‑level visual patterns. Deepstack injection preserves those intermediate signals by projecting each $h_l^{\text{mllm}}$ and injecting them at the matching DiT depth, which yields richer cross‑modal alignment without adding per‑layer adapters.

Scale‑and‑Add conditioning replaces token concatenation with a lightweight latent‑mixing operation: the clean source video latent is scaled by the current timestep and added to the noised target latent.

Negative Temporal RoPE indices assign reference images negative timestamps (e.g., $- \tau, -2\tau$), letting the model distinguish them from target frames that use positive indices.

**Figure 2.** **Overall Architecture of LoomVideo.** LoomVideo seamlessly processes interleaved multimodal inputs using an MLLM. It employs two key designs: (1) A Deepstack injection mechanism, which extracts feature embeddings from every layer of the MLLM and injects them into the corresponding layers of the DiT via cross-attention. (2) A zero-overhead Scale-and-Add conditioning approach for video editing, which scales the clean source video latent by the current timestep and directly adds it to the noised target latent, completely bypassing the severe inefficiency of token concatenation. (3) Negative Temporal RoPE index for multiple reference images.

Stage 1 – MLLM alignment: replace the T5 encoder with the MLLM, train on text‑to‑image/video data at 256 p resolution, using a large batch (≈ 640) and a 4:1 image‑to‑video token ratio.

Stage 2 – Reconstruction and editing: increase resolution to 480 p, add instruction‑based editing data, and introduce a reconstruction task to sharpen visual fidelity.

Stage 3 – Multi‑task fine‑tuning: expose the model to the full mixture of datasets (reference‑guided, multi‑reference, etc.), skewing sampling toward the hardest tasks.

Post‑training reinforcement learning: apply DiffusionNFT with PickScore as the reward model to improve aesthetic quality.

Table 1 (referenced) lists the exact hyper‑parameters used in each stage, including learning rate $2.0\times10^{-5}$, batch sizes, and dynamic timestep‑shift scales.

Implementation and Benchmarking

We benchmark LoomVideo on generation, editing, and efficiency across multiple suites.

We evaluate LoomVideo on a suite of generation, editing, and efficiency benchmarks to validate its performance and speed.

A 5‑billion‑parameter Text‑to‑Video model that serves as a strong baseline for video generation quality.

An 8‑billion‑parameter multimodal LLM that supplies rich cross‑modal features for downstream video tasks.

LoomVideo runs $6.24\times$ faster than the fastest baseline on Text‑to‑Video generation.

Table 8 reports a 6.24× speedup over OmniWeaving for T2V generation.

On VBench, LoomVideo surpasses Wan 2.2 in average score and leads in Imaging Quality and Overall Consistency (see Table 2).

On OpenVE‑Bench, the Stage 2 version of LoomVideo attains the highest overall score, excelling particularly in the Creative Edit metric (Table 3).

RefVIE‑Bench evaluation shows LoomVideo outperforms all open‑source baselines by 7 % in overall score, with a clear margin over VINO (Table 4).

For IntelligentVBench, LoomVideo leads on the TIV2V sub‑task with an 8 % gain over OmniWeaving (Table 5) and matches UniVideo on the more demanding MI2V sub‑task (Table 6).

FashionVideoBench, our e‑commerce‑focused benchmark, records LoomVideo as the top performer across all six sub‑tasks (Table 7).

**Prompt:** Apply the Impressionist aesthetic to this video...The result should emulate the fluid brushstroke techniques and atmospheric focus of 19th-century Impressionist art...

**Figure 4.** Low-quality generation cases of LoomVideo.

**Figure 5.** Qualitative results for LoomVideo on Text-to-Video and Instuction Editing tasks.

LoomVideo delivers up to $6.24\times$ faster inference than larger baselines while preserving state‑of‑the‑art quality.

Quantitative Performance Analysis

LoomVideo (Stage 3) leads the benchmarks, achieving the top overall score on FashionVideoBench.

LoomVideo (Stage 3) attains the highest overall score of 4.60 on FashionVideoBench, beating the next‑best UniVideo by +0.34.

Table 7 reports an overall score of 4.60 for LoomVideo (Stage 3) versus 4.26 for UniVideo.

IntelligentVBench is a benchmark suite that evaluates video‑generation models on compositional multi‑input‑to‑video (MI2V) tasks, measuring both fidelity and compositionality across several metrics.

Extended Qualitative Results

We recap LoomVideo’s efficiency and then detail the ablation evidence.

LoomVideo keeps the unified video generation pipeline lightweight (5 B parameters) while still handling many tasks. Below we examine how each component contributes to that balance.

FashionVideoBench measures a model’s ability to edit videos of fashion items across several fine‑grained tasks, using three human‑rated dimensions (subject consistency, prompt following, visual quality).

How does FashionVideoBench differ from generic video generation benchmarks?

Generic benchmarks usually evaluate a single task (e.g., unconditional generation) with a single quality metric. FashionVideoBench explicitly isolates editing subtasks, requires a reference image for some tasks, and scores three orthogonal aspects, so it reveals where a model excels or fails in fine‑grained editing.

**Table 7.** Results on FashionVideoBench.

**Table 7.** Quantitative comparison on FashionVideoBench.

Figures 6–18 illustrate qualitative edits such as background replacement, object insertion, and motion‑transfer, confirming that the model’s versatility extends beyond the numeric scores.

**Figure.** The image displays a sequence of six frames showing a man cooking. The first three frames show the man in a dark grey shirt, and the last three frames show the man in a light blue shirt, separated by a grey arrow. Below the frames, a text prompt describes a request to replace the background with a desert highway scene featuring heat waves and drifting dust, while keeping the man and SUV still.

**Figure.** A sequence of video frames showing a man standing next to a blue SUV. The left group of three frames shows the original background (a parking lot/building), while the right group of three frames shows the background replaced with a desert landscape, separated by a grey arrow indicating a transformation process.

**Figure.** The image displays a sequence of frames showing a white grand piano on a circular stage against a dark, abstract background with blue floating particles. The sequence is divided into two groups of three frames, separated by a grey arrow pointing to the right. Below the visual sequence, a text prompt is provided: "Prompt: Overlay an animated colorful kite flying in the upper right quadrant of the sky. The kite should move gently with the wind, swaying and fluttering its tail realistically..."

**Figure.** A visual demonstration of video editing/transformation. The left side shows three frames of a field with a running track. A large grey arrow points to the right side, which shows the same three frames but with a colorful kite added to the sky in each frame. Below this, a text prompt describes a different task: "Given the video of a man's natural hands typing on a black laptop keyboard, transform the hands instantly into sleek, metallic robotic hands with articulated joints and glowing blue lights."

**Figure.** A sequence of six images showing a person's hands typing on a laptop keyboard. The left three images show natural human hands, while the right three images show the same hands replaced with metallic, robotic prosthetic hands. A grey arrow between the two sets of images indicates a transformation process.

**Figure 6.** Qualitative results for LoomVideo on Instuction Editing task.

**Figure 7.** Qualitative results for LoomVideo on Instuction-Image Editing task.

The image displays a sequence of photographs showing a woman in a floral dress on the left, transitioning via an arrow to a sequence of photographs showing the same woman in a black blazer dress on the right.

**Figure.** The image displays a sequence of video frames before and after an editing process. The left side shows a series of six frames of a woman in a purple shirt, and the right side shows a corresponding series of six frames where the woman is now holding a black handbag. Below the frames, a text prompt reads: "Edit this video with reference images: Replace the floral embroidered boots with classic black leather ankle boots."

**Figure.** Prompt: Generate a video that follows the movement of people in the reference video and the people and background in the reference image:

**Figure.** A visual demonstration of image-to-image generation or editing, showing a sequence of input images on the left and a sequence of generated output images on the right, separated by an arrow. Below the images, a text prompt specifies the desired attributes and actions for the subject, referencing specific source images.

**Figure.** The image illustrates a multi-modal image generation process. On the left, two source images are provided: "Image 1" (a street background) and "Image 2" (a pair of sunglasses). A large arrow points to a sequence of six generated images showing a man in a black coat, white shirt, and blue tie, wearing sunglasses, standing in the street background. Below the images, a text prompt specifies the desired composition: "Position the man (@Image 2) in the street background (@Image 1) wearing the sunglasses, turtleneck, leather jacket (@Image 3), and trousers, with the watch on his left wrist. Have the man (@Image 2) pull and adjust the hem of the leather jacket (@Image 3) with both hands."

**Figure 8.** Qualitative results for LoomVideo on our benchmark FashionVideoBench.

Conclusion and Future Work

We close with LoomVideo’s impact, future directions, and author credits.

We presented LoomVideo, a compact 5B‑parameter model that delivers over 5.4× faster inference while matching state‑of‑the‑art performance on open‑domain benchmarks. The system combines three novel components—Deepstack injection for deep multimodal alignment, zero‑overhead Scale‑and‑Add conditioning to replace token concatenation, and Negative Temporal RoPE for multi‑reference guidance—enabling robust, non‑rigid edits across diverse video tasks. Looking ahead, we plan to enlarge the diffusion transformer and extend our multi‑resolution pipeline to support higher‑definition and longer‑duration video generation.

Authors are listed by affiliation, with * denoting equal contribution, § indicating corresponding authors, and † marking the project leader. Peking University: Jianzong Wu*†, Hao Lian*, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong§. Alibaba Group: Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Hao Jiang§. We thank MSALab at Peking University and Alibaba Group for their support and discussions throughout this project.

Read the original paper

Open the simplified reader on Paperglide