LongAV-Compass: towards Unified Evaluation of Minute-Scale Audio-Visual Generation across T2AV, I2AV, and V2AV

Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang

A unified benchmark for evaluating minute-long audio-visual generation across text, image, and video inputs.

How can we systematically evaluate minute-scale audio-visual generation across different input modalities (text, image, video) to identify long-range consistency and event-level failures?

Current video generation benchmarks focus on short, 5–10 second clips, leaving models untested on the narrative coherence and temporal consistency required for minute-long content. LongAV-Compass provides a diagnostic framework that evaluates generation across 284 test cases, using a two-dimensional taxonomy of application scenarios and structural complexity to measure performance beyond simple visual quality. The benchmark reveals that leading proprietary models like Seedance 2.0 maintain superior long-form consistency, while most models struggle significantly with the causal and narrative demands of complex, minute-scale generation.

Paper Primer

LongAV-Compass addresses the "short-form bias" in current evaluation by introducing a unified protocol for Text-to-Audio-Video (T2AV), Image-to-Audio-Video (I2AV), and Video-to-Audio-Video (V2AV) tasks. The core move is a dual-representation annotation schema: each test case includes both a global narrative description and a temporally aligned event sequence, allowing for diagnostic assessment of both high-level story coherence and event-level execution.

Current models exhibit a sharp performance drop as structural complexity increases.

Composite scores for proprietary models remain relatively stable (75.0 to 73.9) across difficulty levels, while open-source models drop from 57.9 to 51.4, and agent-based methods fall from 47.3 to 41.2. A ~13% performance gap emerges between proprietary and open-source models as task complexity scales from L1 to L4.

Why is a unified benchmark necessary for these three different input modalities?

Existing benchmarks are fragmented, making it impossible to compare how different conditioning inputs (text, image, or video) affect long-range stability. A unified framework allows researchers to isolate whether a failure is due to the conditioning interface or a fundamental inability to maintain narrative structure over time.

Does native audio support guarantee better audio-visual synchronization?

No. The results show that models with native audio capabilities still differ substantially in long-form audio quality, indicating that audio-visual synchronization remains a distinct bottleneck that does not automatically improve with the ability to generate sound.

Researchers should shift focus from short-clip visual fidelity to event-level fulfillment and long-range narrative continuity. Future model development must prioritize causal progression and cross-event consistency to succeed in minute-scale applications like commercial advertising.

Introduction

We expose why short-form benchmarks miss long-range failures and introduce the unified LongAV‑Compass evaluation.

Current video generation benchmarks are limited to short clips, which fail to expose the accumulation of errors and transition failures inherent in minute‑scale audio‑visual generation. Consequently, models are not pressured to maintain identity consistency, narrative coherence, or audio‑visual alignment over longer horizons.

These three conditioning types define what the model receives as input before generating a minute‑long audio‑visual sequence.

Step 1 – The model produces the initial 5 seconds showing the chef chopping vegetables, with correct kitchen tools and matching sizzling sounds.

Step 2 – At 30 seconds the chef’s apron color changes unexpectedly, indicating a loss of identity consistency.

Step 3 – After 40 seconds the audio track continues the sizzling sound while the visual shows a silent plating scene, revealing a misalignment between audio and video.

Evaluating only the first few seconds hides these later failures; minute‑scale benchmarks are needed to surface identity drift and audio‑visual desynchronization.

The shift from short-form to minute‑scale generation requires new evaluation paradigms.

The Evaluation Gap

We expose the missing long-form evaluation gap and introduce the LongAV-Compass benchmark.

Current evaluation pipelines remain anchored to short‑form clips, where a single segment suffices to judge visual fidelity or coarse semantic alignment. Benchmarks such as VBench [8] and EvalCrafter [14] have standardized short‑video assessment, while VABench [7] and Text‑to‑Audio‑Visual (T2AV)‑Compass [2] extend evaluation to synchronized audio‑visual output. However, these designs do not capture the failures that emerge only over minute‑scale generation.

**Figure 1.** Overview of LongAV-Compass. The benchmark unifies T2AV, I2AV, and V2AV under shared taxonomy, event-level annotation and a hierarchical evaluation framework, enabling diagnosis of long-range audio-visual failures beyond flat leaderboard comparison.

Three concrete limitations follow from this short‑form focus. First, benchmarks operate on a temporal scale that cannot reveal whether models stay coherent over minute‑long generation. Second, coverage is fragmented across input conditions, making it hard to compare Text‑to‑Audio‑Visual (T2AV), Image‑to‑Audio‑Visual (I2AV), and Video‑to‑Audio‑Visual (V2AV) systems under a single protocol. Third, existing evaluations lack diagnostic visibility into long‑range degradation such as cross‑event identity drift, unstable transitions, and audio‑visual sync decay.

LongAV‑Compass provides a unified, minute‑scale benchmark that decomposes long videos into event‑aligned segments and evaluates them across a rich set of dimensions.

The model produces Event 1: sunrise video (frames 0‑120) with ambient birdsong.

Event 2 generation starts at frame 121; the model reuses the sunrise lighting parameters, yielding a daylight‑inconsistent visual.

Audio for Event 2 is a copy of the sunrise birdsong, failing to reflect urban traffic sounds.

This failure illustrates that short‑form benchmarks miss cross‑event consistency errors, which only become visible when evaluating minute‑scale, multi‑event generation.

Prior Benchmarks

We situate LongAV‑Compass among existing video and audio‑visual benchmarks.

Prior work has focused on short‑form video suites and emerging audio‑visual benchmarks, each addressing a slice of the multimodal generation problem.

VBench is a short‑form video benchmark that measures visual fidelity, motion realism, semantic alignment, and prompt adherence for text‑conditioned clips.

Compared with earlier suites, LongAV‑Compass uniquely offers unified X2AV coverage and evaluates clips longer than one minute, directly targeting the long‑range failures the paper’s premise highlights.

Benchmark Design

Design of the LongAV‑Compass benchmark, covering tasks, taxonomy, data pipelines, and annotation.

Define three long‑form generation tasks (T2AV, I2AV, V2AV) and their input modalities.

Establish a two‑dimensional taxonomy: four scenarios (Personal Vlog, Brand AD, Performance AD, Content Creator) and four complexity levels (L1–L4).

Specify prompt detail levels (short, medium, long) as an orthogonal variable.

Build data for each task:

Convert every case to a unified annotation format: a global description plus an event‑level sequence.

Run dual quality control—automatic MLLM review followed by human validation—to filter out implausible or low‑quality samples.

**Figure 3.** Data construction pipeline of LongAV-Compass. LongAV-Compass builds its benchmark data for three task types: T2AV, I2AV, and V2AV. T2AV and I2AV cases are obtained through two complementary routes: scenario-template-based LLM generation and real-video-based transcription or adaptation. V2AV cases are constructed from real videos by extracting reference clips and generating continuation scripts. After task-specific construction, all cases are converted into a shared event-level annotation format and filtered through dual quality control with MLLM review and human validation.

**Figure 2.** Scenario and difficulty distribution in LongAV-Compass. The benchmark spans four application scenarios and multiple complexity levels (L1–L4), supporting analysis by both content domain and generation difficulty.

**Table 2.** Task coverage in LongAV-Compass. S, RI, and RV denote script, reference image, and reference video, respectively.

Event 1 spans seconds 0–20, specifies a camera pan and background music.

Event 2 spans seconds 20–80, defines actor A holding the product and a voice‑over line.

Event 3 spans seconds 80–120, requires a fade‑out and a logo animation.

This concrete breakdown shows how a high‑level script is decomposed into temporally aligned events that later serve as fine‑grained evaluation points.

Evaluation Metrics

We detail the metrics and scoring procedure used to assess long-form video and audio generation.

LongAV‑Compass evaluates each generated sample by first aligning it with the event annotations, then measuring video and audio quality across a suite of complementary metrics. The pipeline is deterministic and applies the same set of scores to every model, enabling fair comparison.

Metrics are computed on segments that correspond exactly to the annotated events, so each score reflects performance on the intended semantic unit rather than on arbitrary frame windows.

How does event‑aligned evaluation differ from the frame‑wise scoring used in earlier video benchmarks?

Earlier benchmarks aggregate scores over uniformly sampled frames, which can dilute errors that occur only on specific events. Event‑aligned evaluation restricts each metric to the exact temporal span of an annotated event, so a failure on any event directly lowers the overall score.

A single scalar that combines the six video and three audio metrics using a weighted average, providing an overall performance indicator.

Why are the metric weights not learned automatically from data?

The paper treats the weights as design choices that encode the evaluation priorities of the benchmark (e.g., emphasizing continuity over visual fidelity). Learning them would entangle the benchmark’s intent with model‑specific performance patterns, defeating the purpose of a fixed evaluation protocol.

Generate the full‑length video (and audio, if supported) from the text prompt using the target model.

Align the generated video with the event annotations to obtain per‑event clips.

For each event clip, run the MLLM to compute VQA, VQ, Cont., Trans., Hol., and TVAlign scores.

If audio is present, compute AVS, AudQ, and AudL on the synchronized audio‑video stream.

Normalize all scores to their respective ranges ($0\!-\!1$ for VQA/TVAlign, $1\!-\!5$ for the others).

Apply the Balanced Score weighting scheme and aggregate across events to obtain video‑level results.

Normalize VQA and TVAlign to $0\!-\!1$ (already in that range); normalize VQ, Cont., Trans., Hol. from $1\!-\!5$ to $0\!-\!1$ by $(x-1)/4$.

Compute per‑event averages: Event 1 video average = $(0.8 + 0.8 + 0.8 + 0.8 + 0.8)/5 = 0.8$; Event 2 video average = $(0.6 + 0.7 + 0.7 + 0.7 + 0.7)/5 = 0.68$.

Apply the Balanced Score weights (e.g., equal weight for each video metric, zero for audio) and average across the two events, yielding a final Balanced Score ≈ 0.74.

This toy example shows how a single low VQA score can drag down the overall Balanced Score, even when other metrics are high, illustrating the sensitivity of the aggregate to event‑level failures.

T2AV Performance

Proprietary models set the top event‑fulfillment scores on the T2AV benchmark.

Proprietary models achieve higher event‑level fulfillment than open‑source models on the T2AV task.

Best proprietary VQA score = 0.9274 (Seedance 2.0) versus best open‑source VQA score = 0.5994 (Open‑Sora) in Table 3.

All models were run on the identical LongAV‑Compass benchmark, sharing the same video prompts, evaluation scripts, and scenario splits. The only variable was the underlying generation architecture (proprietary, open‑source, or agent‑based), ensuring a fair comparison of event fulfillment.

**Figure 4.** Scenario-level balanced scores on T2AV task. For each scenario, each bar reports the mean balanced score of one model over all available samples in that scenario.

Proprietary models generally outperform open‑source models in event fulfillment.

I2AV Performance

I2AV evaluation shows Seedance 2.0 dominates across all metrics.

Seedance 2.0 outperforms all competitors on I2AV, leading in five of seven evaluation dimensions.

Table 4 shows Seedance 2.0 has the highest scores in VQ, Hol., AVS, AudQ, and AudL; Figure 5 shows it achieves the highest scenario‑level balanced scores across all four scenarios.

All models were evaluated under the same LongAV‑Compass protocol, using identical video and audio diagnostics and the same set of four advertising scenarios.

**Table 4.** Main results on I2AV task. In addition to shared video and audio diagnostics, we report image alignment through first-frame anchoring and CLIP-based event-level image-video alignment. The highest score in each dimension is boldfaced and highlighted in green.

**Figure 5.** Scenario-level balanced scores on I2AV task. For each scenario, each bar reports the mean balanced score of one model over all available samples in that scenario.

V2AV Performance

Proprietary models dominate V2AV performance across the full metric suite.

Seedance 2.0 outperforms every open‑source model on V2AV by a clear margin across eight evaluation dimensions.

Table 5 shows Seedance 2.0 leading on all Event, Consistency, Global‑Pres, and Text‑Align scores; the next‑best open‑source model (Helios‑Distilled) trails by 0.15–0.30 points on average.

All models were evaluated under the same LongAV‑Compass protocol: identical text prompts, the same event‑aligned segmentation, and uniform inference settings (batch size, temperature, and frame rate).

**Table 5.** Main results on V2AV task. We report event-level fulfillment and quality, long-form consistency, global presentation, text-video alignment, and audio diagnostics for video continuation. The highest score in each dimension is boldfaced and highlighted in green.

Analysis and Findings

LongAV-Compass evaluates minute‑scale generation by aligning evaluation to event boundaries, exposing errors that short‑clip benchmarks miss.

Segments are defined by the script’s annotated events rather than fixed time windows, letting the evaluator focus on where the story actually changes.

How does task‑aligned segmentation differ from a simple fixed‑length window?

Fixed windows cut the video at arbitrary timestamps, potentially breaking an event in half; task‑aligned segmentation respects the script’s event annotations, so each segment corresponds to a complete semantic unit and transition clips are evaluated separately.

Metrics are grouped by diagnostic dimension and reported per task, keeping comparisons fair while exposing task‑specific failure modes.

Proprietary models achieve the highest balanced scores across all difficulty levels, peaking at 75.2 in difficulty L2.

Table 6 shows proprietary families scoring 70.6, 75.2, 47.3, and 57.9 across L1‑L4 respectively, outpacing open‑source and agent‑based families.

**Table 6.** Per-difficulty analysis. Each entry reports the average balanced score for a model family under one difficulty level.

Across the four ad scenarios, proprietary models retain a roughly 15‑point advantage over open‑source models.

Scenario‑level analysis shows proprietary scores consistently higher, e.g., a 14.8‑point lead in Brand Ads and similar gaps in Performance Ads, Content‑Creator, and Vlog.

**Figure 6.** Case study of event-aligned evaluation in LongAV-Compass. Using a Brand Ads case as an example, the upper row decomposes the generated video into ordered events and boundary clips for transition-stability assessment. The middle row illustrates event-level QA for measuring event fulfillment, and the bottom row summarizes full-video quality signals, including holistic presentation, video quality, and text-/image-video alignment.

**Figure 9. Event-count analysis.** Samples are grouped into short event chains ($\le 4$ events) and longer event chains ($>4$ events), and each bar reports the average balanced score for one model family.

Input Format Sensitivity

We evaluate how different conditioning formats affect long‑video generation across model families.

Across the four evaluated models, performance varies markedly with the conditioning format, and no single format dominates for all systems. Consequently, selecting the appropriate input formulation must be tailored to each model’s interface to maximize quality and temporal stability.

**Figure 7.** Capability profiles of proprietary models. Scores are min-max normalized per metric across the displayed models to highlight relative capability differences.

**Figure 8.** Capability profiles of open-source models. Scores are min-max normalized per metric across the displayed models to highlight relative capability differences.

Human Alignment

Human judgments closely match benchmark scores, confirming the metrics’ relevance.

Human judgments correlate strongly with benchmark scores across the three evaluated dimensions.

Figure 10 shows Spearman $\rho$ up to 0.935 for visual quality, with similarly high correlations for content fidelity and long‑video stability.

All three dimensions were aggregated using the same event‑level scoring pipeline as the benchmark, and the full set of task annotations, raw model JSON outputs, and evaluation traces will be released to enable exact replication.

**Figure 10. Human-alignment validation.** Each point denotes one model, with proprietary models shown as circles and open-source models shown as triangles. The three panels compare human-derived and benchmark-derived pairwise win rates for content fidelity, visual quality, and long-video stability.

Case Studies

LongAV-Compass exposes errors that short benchmarks miss, illustrated by detailed case studies.

LongAV‑Compass evaluates generation over minute‑scale videos, revealing failure modes that short clips cannot surface. The following case studies exemplify the annotation depth, generation difficulty, and evaluation behavior of the benchmark.

**Figure 13.** Prompt template for I2AV image prior extraction.

C.1 T2AV case (Performance Ads Product Demonstration) targets a 60‑second skincare ad with two actors, four events, and a “water‑burst” product effect, classified as complexity L4.

The global description places Xiao Lin (the confident expert) and Xiao Ya (the troubled user) in a bright indoor setting, detailing a sequence of close‑ups, product handling, and a mist‑like visual effect that triggers a surprised smile.

Event 1 (0–18 s, 3 shots) shows Xiao Lin recommending the serum; Event 2 (18–35 s, 3 shots) demonstrates the “water‑burst” texture on her hand; Event 3 (35–50 s, 3 shots) captures Xiao Ya’s first application and reaction; Event 4 (50–60 s, 2 shots) ends with a product hero shot.

Identity tracking must keep Xiao Lin’s confident demeanor and Xiao Ya’s gentle personality consistent across all shots, while the product’s visual appearance remains unchanged.

Generation challenges include coordinating two distinct roles, rendering a realistic water‑burst effect, managing facial‑expression transitions, and preserving brand packaging throughout the sequence.

The QA checklist for Event 1 asks whether both women and the product are visible, whether the hand‑off occurs, and whether the bright indoor window setting is evident.

**Figure 15.** Dataset statistics of LongAV-Compass. (a) Sample count per task with language distribution below. (b) Category distribution across tasks. (c) Complexity level distribution (L1–L4) per task. (d) Events per sample (bar = mean, error bar = min/max range). (e) Shots per sample.

C.2 V2AV case (Content‑Creator Short Film Continuation) requires extending a reference video of a man running through a train station into a dramatic, fantasy‑laden narrative, also classified as L4.

The global description outlines a collision that scatters papers, an eye‑contact moment with a dark‑haired woman, a dream montage of romantic scenes, and a final departure on the platform.

Event 2 (8–13 s) captures the collision and paper scatter; Event 3 (13–17 s) records the eye contact; Event 4 (17–32 s) presents six rapid romantic sub‑scenes; Event 5 (32–40 s) shows the woman boarding the train; Event 6 (40–52 s) leaves the man alone; Event 7 (52–65 s) displays the title card.

Identity tracking must preserve facial and physical features of both subjects across all scenes, including the montage.

Physical constraints enforce consistent watch, band, and desk appearance; consistency constraints require the silver watch, gray‑white band, and warm lighting to persist throughout.

Generation challenges stem from maintaining continuity with the reference video, handling rapid fantasy cuts while preserving identity, and ensuring emotional transitions from surprise to love to loss.

C.3 I2AV case (Product Lifestyle Image to Video) starts from a flat‑lay photo of an Apple Watch on a wooden desk and asks the model to produce a 60‑second performance ad that respects the original composition.

The extracted image prior lists key objects (watch, band, notebook, pen, iPod), composition (top‑down angle, shallow depth of field), and lighting (warm left‑side illumination).

Generation challenges include exact visual style preservation, smooth transition from still to motion, fine‑grained UI animation on the watch screen, and a final return to the original composition.

C.4 Challenging cases highlight two especially difficult categories: a high‑event‑count product review (18 events) demanding consistent hand appearance, product colors, and background across rapid procedural steps; and a multi‑actor drama (13 events) requiring distinct costumes, facial features, and emotional arcs across varied camera angles.

Model Evaluation Details

We detail the model set, protocol, and diagnostic metrics used for V2AV evaluation.

We evaluate eleven video generation systems—grouped into proprietary, open‑source, and agent‑based categories—under a unified protocol that preserves each model’s native settings and enforces a 60‑to‑120 second output window.

Data Construction Details

Details of prompt templates and dataset statistics used to build LongAV‑Compass.

Section A.1 enumerates the four prompt pipelines—T2AV real‑video, T2AV LLM‑template, I2AV image‑conditioned, and V2AV continuation—that drive the benchmark construction.

The T2AV real‑video transcription prompt asks Gemini 3.1 Pro to watch a source video and output a JSON script that splits the clip into 2–4‑shot events, each with visual description, audio expectation, and a completion flag, while tracking recurring subjects and enforcing physical constraints.

Example JSON schema (escaped for XML): { "language": "zh|en", "`global_description`": "<one‑sentence summary>", "events": [ { "`event_id`": "<id>", "`time_range`": "<start‑end>", "action": "<description>", "`completion_flag`": "<true/false>", "`visual_description`": "<detailed description>", "`audio_expectation`": "<expected audio content>", "shots": [ { "`shot_id`": "shot1", "`time_range`": "<start‑end>", "description": "<shot description>" } // additional shots ] } // additional events ] }

The T2AV LLM‑template generation prompt lets human designers specify scenario, complexity level (L1–L4), and language; Gemini 3.1 Pro then produces a full script following the same JSON format, inserting three fixed QA questions per event to verify subject presence, core action occurrence, and key visual detail correctness.

Its JSON skeleton mirrors the real‑video schema, differing only in the inclusion of a “complexity level” field and the QA list for each event.

The I2AV image‑conditioned generation pipeline first extracts a structured image prior (subjects, composition, lighting, motion potential) from a reference image, then feeds that prior into a second prompt that generates an event script anchored to the visual content.

Example prior JSON (escaped): { "`image_summary`": "<one‑sentence summary>", "`detailed_visual_description`": "<full visual description>", "subjects": [ ... ] }

For V2AV continuation construction, a 10–15 s reference video is supplied; Gemini 3.1 Pro watches it and emits a continuation script covering the remaining 45–50 s, providing separate video‑prompt and audio‑prompt fields that concatenate all event descriptions for downstream generation models.

Read the original paper

Open the simplified reader on Paperglide