MMAE: A Massive Multitask Audio Editing Benchmark

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

MMAE is a comprehensive benchmark for instruction-based audio editing that exposes the failure of current models to achieve reliable, high-fidelity edits.

How can we systematically evaluate general-purpose, instruction-based audio editing models across diverse modalities and task complexities?

Current audio editing models are evaluated using fragmented, domain-specific metrics that fail to capture the nuance of open-ended, instruction-based manipulation. This lack of a standardized testbed makes it impossible to diagnose why models struggle with complex tasks like multi-hop reasoning or mixed-modality editing. The authors introduce the Massive Multitask Audio Editing (MMAE) benchmark, which uses a rubric-based evaluation paradigm to decompose free-form editing tasks into thousands of verifiable, atomic criteria. This framework assesses both instruction following and context consistency across seven audio modalities and six levels of task complexity. Evaluation of leading models reveals that current systems are far from reliable, with Exact Match Rates consistently falling below 5% and dropping to 0% in complex, mixed-modality scenarios.

Paper Primer

MMAE addresses the "evaluation gap" by moving away from coarse signal-level metrics (like FAD or CLAP similarity) toward a structured, rubric-based approach. By treating each editing task as a series of independent, verifiable checkpoints, the benchmark provides an interpretable diagnostic tool for identifying where a model's cognitive pipeline—perception, reasoning, or generation—breaks down.

Current audio editing models lack structural robustness and fail to achieve flawless execution.

Across all evaluated models, the Exact Match Rate (EMR) remains below 5%, indicating that models rarely satisfy all criteria for a successful edit. EMR plummets to an absolute 0% in complex, mixed-modality scenarios.

There is a fundamental decoupling between average competency and perfect editing reliability.

Models that perform well on average metrics (Instruction Following Rate and Consistency Rate) do not necessarily achieve higher EMR, suggesting that current systems act as "mean-seeking" generalists that frequently introduce minor errors. Top-performing models achieve only ~45–50% on average instruction following, far below the threshold for production-ready editing.

Why is a rubric-based paradigm necessary for audio editing compared to traditional automated metrics?

Traditional metrics like signal-to-noise ratios or generic similarity scores cannot distinguish between successful instruction execution and the preservation of unrelated audio context. Rubrics decompose these multifaceted tasks into atomic, verifiable properties, allowing for objective assessment of both accuracy and fidelity.

Does using an external agentic planner improve model performance on these complex tasks?

No; the authors found that incorporating external planners yields no consistent improvement. The bottleneck lies in the base model's inability to perform precise multimodal perception and the accumulation of artifacts during iterative generation steps.

The MMAE Benchmark Overview

We introduce MMAE, a unified benchmark that standardizes instruction‑based audio editing evaluation.

Current evaluation for instruction‑based audio editing is fragmented, with each sub‑domain using its own small test set. This hampers progress because improvements cannot be compared across tasks. MMAE fills the gap by offering a single, massive benchmark that covers a wide range of editing scenarios.

A massive, multi‑task benchmark that aggregates 2 000 high‑fidelity audio clips across seven modalities and defines a six‑level taxonomy for systematic evaluation.

How does MMAE differ from earlier audio‑editing benchmarks?

Earlier benchmarks typically target a single modality or a narrow set of operations, and they provide only a binary pass/fail score. MMAE expands coverage to seven modalities, defines a rich taxonomy of complexity, granularity, and operation types, and uses a rubric‑based scoring system that yields multiple diagnostic metrics rather than a single accuracy number.

**Figure 1.** Examples from the MMAE benchmark, illustrating the overall taxonomy and the proposed rubric-based evaluation paradigm. These examples span diverse modalities, complexity, and operation types, with a subset of associated rubrics shown for clarity.

The examples in Figure 1 demonstrate the benchmark’s breadth: from modifying the resonant frequency of a single glass hit, to enhancing speech over background music, to swapping lyrics and vocal timbre, to removing audience cheers, to reordering spoken author names across multiple rounds, and finally to adding seagull sounds with a soothing ocean‑wave backdrop.

MMAE provides a standardized testbed for instruction‑based audio editing.

The Shift to Interactive Audio Editing

Audio editing now demands unified, rubric‑driven evaluation to match its rapid model advances.

It lets a user give a natural‑language command to modify an audio clip—changing speech, music, or sound effects—while the model must understand the instruction and produce the edited audio.

Recent breakthroughs in image and video editing have turned generative models into interactive tools, and the audio community is now witnessing a surge of instruction‑based editing models that promise similar flexibility for speech, music, and sound effects.

Despite this progress, the evaluation infrastructure for audio editing lags severely; a robust framework must improve on two fronts: broader data coverage and a more expressive evaluation paradigm.

Existing benchmarks are highly fragmented, often confined to a single subdomain (e.g., speech‑only) or to basic operations such as addition or removal, leaving a gap in comprehensive assessment across seven modalities, six complexity levels, two granularity scales, and eight operation categories.

Empirical evaluation on the MMAE benchmark shows that current models achieve an Exact Match Rate below 5% overall and drop to 0% on complex, mixed‑modality tasks, revealing critical shortcomings in instruction execution, context preservation, and structural robustness.

Prior Approaches to Audio Editing

Survey of prior audio editing models and fragmented evaluation practices.

Early work such as AUDIT introduced latent diffusion for text‑guided sound‑effect editing, while later systems like AudioEditor, AudioMorphix, MMEdit, SmartDJ, VoiceCraft, CosyEdit, Step‑Audio‑EditX, Ming‑UniAudio, Audio‑Omni, AudioChat, InstructAV2AV, and SpongeBob expanded capabilities across modalities and tasks. Evaluation has remained fragmented, with speech‑focused test sets like RealEdit using WER and speaker similarity, and domain‑specific benchmarks such as Ming‑Freeform‑Audio‑Edit, Step‑Audio‑Edit‑Benchmark, and StoryGen‑Eval relying on signal‑level metrics (FAD, LSD, CLAP) or MOS ratings. MMAE addresses this gap by providing a unified, rubric‑based benchmark covering sound, music, speech, and mixed audio, enabling consistent measurement of instruction following and content consistency.

Benchmark Composition and Statistics

The benchmark presents a balanced mix of sound, music, and speech modalities.

MMAE allocates a substantially larger share to mixed‑modality audio than to any single‑modality class.

Mix accounts for 36.2% of samples, while the largest single‑modality class (Sound, Music, or Speech) is 21.3%.

**Figure 2.** Distribution of the MMAE benchmark across three taxonomy dimensions.

The benchmark covers a balanced mix of sound, music, and speech modalities.

The Rubric-Based Evaluation Paradigm

Defines the rubric‑based evaluation framework for instruction‑based audio editing.

Evaluating instruction‑based audio editing requires a framework that measures both edit correctness and generation quality.

The framework scores each editing sample by asking a set of independent multiple‑choice rubrics and letting a high‑performance audio language model (MLLM) select the answer.

How does this rubric‑based approach differ from using a standard automatic metric such as BLEU?

BLEU aggregates token‑level n‑gram overlap into a single score and cannot isolate specific editing failures. The rubric paradigm asks targeted multiple‑choice questions, so each failure (e.g., wrong duration or unintended timbre change) is scored separately, giving a diagnostic view that BLEU cannot provide.

**Table 1.** Statistics of the dataset.

Constructing the MMAE Benchmark

We describe the five‑stage pipeline that builds the MMAE benchmark.

The pipeline converts raw audio editing ideas into a vetted benchmark by iterating through idea generation, taxonomy design, instruction‑driven collection, rubric creation, and strict quality checks.

How does the “dynamic balancing” in data collection differ from simply sampling tasks at random?

Dynamic balancing tracks the current counts of examples per modality, operation, and complexity; when a dimension is under‑represented the system preferentially selects new clips that fill that gap. Random sampling would ignore these counts, leading to skewed distributions where common tasks dominate.

Brainstorming – expert annotators run multiple rounds to propose diverse audio editing scenarios, covering many modalities and difficulty levels.

Taxonomy & Paradigm Construction – the proposals are organized into an orthogonal taxonomy (modality, operation, complexity) and a rubric‑based evaluation framework is defined.

Instruction‑Centric Data Collection – annotators retrieve raw audio from online videos, trim clips, write natural‑language instructions, and label metadata; a dynamic balancing algorithm ensures even coverage across the three taxonomy axes.

Rubrics Annotation – an Omni‑Detective agent extracts detailed captions, an LLM drafts rubric items, and human annotators refine them; a second LLM normalizes the final rubric expressions.

Quality Inspection – blind inspectors independently review each item; failing items are sent back for correction or discarded, guaranteeing high‑fidelity benchmark data.

Annotator searches online, downloads a 10‑second speech clip, and trims it to a 5‑second input segment.

Annotator writes the instruction “Remove the spoken sentence about weather” and tags modality=speech, operation=removal, complexity=medium.

The dynamic balancer flags this request as high priority because the “speech‑removal‑medium” count is low, so the system queues the item for immediate rubric generation.

Omni‑Detective produces a detailed caption; the LLM drafts a rubric with items “Did the model correctly delete the target sentence?” and “Is the remaining audio natural‑sounding?”.

Human annotators refine the rubric, then a blind inspector reviews the completed item; it passes and is added to the benchmark.

Balancing ensures that even niche combinations receive sufficient examples, preventing the benchmark from being dominated by easy or popular tasks.

**Figure 3.** A comprehensive data curation pipeline of the MMAE benchmark. The process includes: (1) expert-driven brainstorming to collect diverse audio editing scenarios; (2) taxonomy and paradigm construction, establishing the multi-dimensional task taxonomy and the rubric-based evaluation framework; (3) instruction-centric data collection with dynamic balancing across taxonomy dimensions; (4) human-agent collaborative annotation with automated rubric generation and human refinement; and (5) iterative quality inspection with revision and filtering to ensure data quality.

**Figure 4.** A Snapshot of the platform used for data annotation and quality inspection.

Benchmarking Audio Editing Models

We report MMAE benchmark results for five audio editing models, exposing large modality and complexity gaps.

We evaluate five recent end‑to‑end audio editing systems on the MMAE benchmark and compare them against two simple baselines. The evaluation uses a rubric‑based MLLM judge to measure instruction following, consistency, and exact match.

Step‑Audio‑EditX is a single‑pass model that maps an instruction directly to an edited audio waveform.

How does Step‑Audio‑EditX differ from pipelines that decompose instructions?

It treats the entire command as a single transformation, so it cannot isolate atomic operations; this makes complex multi‑category edits harder to handle.

Ming‑UniAudio is a unified model that jointly learns audio representation and editing across multiple modalities.

Why does Ming‑UniAudio underperform on multi‑category edits?

Its training emphasized single‑category examples, so the model has limited capacity to capture the interactions required for mixed‑modality instructions.

MMEdit is a multitask audio editor constrained to short inputs (≤10 s) for efficiency.

What limitation arises from MMEdit’s ≤10 s input restriction?

It cannot attend to longer temporal structures, so its scores on the full benchmark are not directly comparable to models that process the entire audio.

Audio‑Omni is a modality‑agnostic encoder that excels on short clips, achieving the highest instruction‑following rates among the restricted models.

Why does Audio‑Omni achieve the highest IFR despite the short‑duration constraint?

Its modality‑agnostic design efficiently extracts relevant cues from brief audio, allowing it to follow instructions more accurately within the limited context.

SmartDJ is an audio editing system that can be paired with an external planner to decompose complex commands into sequential atomic edits.

What benefit does the external planner provide for SmartDJ?

It breaks a complex instruction into a chain of atomic edits, allowing the base model to handle each step reliably and improving overall consistency.

Even the strongest model (SmartDJ w/o planner) attains only 5.56 % Exact Match Rate, indicating severe difficulty of the MMAE tasks.

Table 2 shows SmartDJ w/o planner achieving 5.56 % EMR, the highest among all evaluated systems.

**Table 2.** Main results on the MMAE benchmark. (a) Performance grouped by complexity, reporting scores for single and multiple categories, along with the overall score. (b)(c) Performance breakdown across different modalities, with scores reported within each category. IFR = Instruction Following Rate, CR = Consistency Rate, EMR = Exact Match Rate. The best results are presented in bold. "Identity" denotes directly returning the input without modification, while "Noise" denotes generating pure noise. Results under these settings are reported as baselines for reference. *MMEdit, Audio-Omni, and SmartDJ are either limited to inputs of at most 10 seconds or trained solely on data with durations ≤ 10 seconds. Accordingly, we evaluate these models only on samples with duration ≤ 10 seconds (801 samples).

Current models show significant performance gaps across different modalities and task complexities.

Performance Analysis and Insights

Discussion reveals performance drops with task complexity and a persistent IFR‑CR trade‑off.

Task complexity sharply reduces instruction‑following performance.

Audio‑Omni’s IFR falls from 58.43% on single‑modality tasks to 41.70% on mixed‑modality tasks, a drop of 16.73 percentage points.

**Table 1.** Evaluation results of different models on Sound, Music, and Speech datasets using IFR, CR, and EMR metrics.

Representative Benchmark Samples

Representative MMAE cases illustrate task diversity and rubric granularity.

The Massive Multitask Audio Editing (MMAE) benchmark showcases four distinct instruction‑based audio editing scenarios, each paired with a fine‑grained rubric that probes content, timbre, and quality.

Cases 1‑4 span music, speech, and sound domains, and each rubric contains multiple IF/Con. questions targeting specific audio attributes.

Across all cases the evaluation protocol remains constant: the same Multimodal Large Language Model (MLLM) judge applies the rubric, and the instruction‑based edits are performed on the MMAE benchmark inputs.

**Table.** Evaluation of audio output consistency and content accuracy.

The table presents a rubric for evaluating audio output quality and content consistency across six specific test cases (numbered 8 through 13). Each row includes a category ("Con."), a specific question comparing audio output segments to input segments, and a set of response options marked with green checkmarks for positive outcomes and red crosses for negative or alternative outcomes.

The image displays a table with three columns: "#", "Category", and "Rubric". It lists two evaluation criteria for audio output, focusing on the detection of dog barking sounds.

Consistency Rating Rubrics

We expand the MMAE benchmark with dozens of new rubric items covering consistency and instruction‑following.

The MMAE benchmark now defines 42 detailed rubric items that span both Instruction‑Following (IF) and Consistency (Con) evaluation.

Counts of individual questions across case descriptions and tables 9‑14 sum to 42.

All items share a uniform multiple‑choice format (one correct answer, two distractors, and a “None of the above” option), ensuring that the evaluation protocol remains constant across the expanded set.

**Table.** Evaluation rubric for audio output quality and content consistency.

**Table 1.** Evaluation rubric for comparing audio segments regarding Standard Mandarin pronunciation.

**Table.** Evaluation Rubric for Audio Output Comparison.

**Table.** Evaluation rubric for audio output comparison.

The image displays a table with three columns: "#", "Category", and "Rubric". The table contains two rows of evaluation criteria for audio comparison tasks, labeled 1 and 2, both categorized as "IF". Each row presents a specific question (Q) followed by a list of multiple-choice options, where the correct or selected answer is marked with a green checkmark and the incorrect options are marked with red crosses.

**Table.** Evaluation Rubric for Audio Comparison Tasks. The table lists evaluation criteria for comparing audio inputs and outputs, categorized by "#" (3-9) and "Category" (IF or Con.). Each entry includes a specific question (Q) regarding audio characteristics such as pitch, speech recognition, ending structure, beat transitions, build-up approaches, and timing, followed by a set of multiple-choice options marked with checkmarks (correct) or crosses (incorrect).

System and User Evaluation Prompts

Provides the exact prompts used to query the MLLM judge for rubric‑based audio evaluation.

This section spells out the exact prompt sent to the external MLLM judge (Qwen3‑Omni) that scores audio edits against the rubric.

System prompt for the audio‑analysis assistant.

User prompt template sent to the MLLM judge.

The image displays a rubric table used for evaluating audio output quality and content consistency. The table consists of three columns: "#" (index), "Category" (IF or Con.), and "Rubric" (containing specific questions and multiple-choice options). - **Row 1:** Category IF, Question regarding lyrics sung by the lead vocalist. - **Row 2:** Category IF, Question regarding the timbre of the human voice. - **Row 3:** Category Con., Question regarding the similarity of the overall accompaniment. - **Row 4:** Category Con., Question regarding the consistency of melody and rhythm. - **Row 5:** Category Con., Question regarding audio quality degradation.

The table presents a rubric for evaluating audio output quality across three criteria (items 10, 11, and 12), categorized under "Con." Each item includes a specific question comparing an input and output audio file, followed by a set of response options marked with checkmarks or crosses.

The image displays a rubric table with three columns: "#", "Category", and "Rubric". The table contains four rows, each representing a specific question (Q) related to audio output evaluation, categorized as "IF". Each question is followed by multiple-choice options, where the selected answer is marked with a green checkmark and the unselected options are marked with a red 'x'.

The data curation platform recorded all rubric annotations, performed multi‑stage review, and ensured versioned provenance for every judgment.

Read the original paper

Open the simplified reader on Paperglide