Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

Robust-U1 equips MLLMs with explicit visual self-recovery to restore corrupted images before reasoning.

Can MLLMs be trained to self-recover corrupted visual inputs into clean representations to improve downstream reasoning robustness?

Multimodal Large Language Models (MLLMs) struggle with real-world visual corruptions like noise or compression, as existing methods either rely on opaque feature alignment or text-only reasoning that fails to restore lost pixel-level details. Robust-U1 introduces an explicit self-recovery module that reconstructs a clean version of the input image before performing multimodal reasoning. The model learns to generate a restored image through a three-stage pipeline: supervised fine-tuning, reinforcement learning with dual pixel-semantic rewards, and joint reasoning over both the corrupted and recovered visuals. On the R-Bench benchmark, Robust-U1 achieves state-of-the-art robustness, significantly outperforming both general-purpose MLLMs and prior robust-specific architectures across all corruption intensities.

Paper Primer

The core mechanism hinges on a three-stage training pipeline that transforms a unified MLLM into a self-correcting system. The model first learns to invert the corruption process via supervised fine-tuning, then refines the output using reinforcement learning with dual rewards: a structural similarity index (SSIM) for pixel-level fidelity and a CLIP-based similarity score for semantic consistency.

The final reasoning stage is the critical move: the model is trained to process an interleaved sequence of the original corrupted image and the newly generated recovered image. This joint input allows the model to leverage the restored visual content for primary understanding while retaining the original context to resolve ambiguities.

Robust-U1 achieves state-of-the-art robustness on real-world corruption benchmarks.

On the R-Bench benchmark, the model consistently outperforms both general MLLMs and specialized robust baselines across MCQ, VQA, and captioning tasks. The performance advantage increases with corruption severity, with the model maintaining superior accuracy where prior methods fail.

Visual self-recovery is the primary driver of performance gains.

Ablation studies show that adding self-recovery supervision yields a 0.0853 improvement in R-Bench scores, significantly higher than the 0.0429 gain from adding chain-of-thought supervision alone. The full pipeline achieves an overall R-Bench score of 0.7398, compared to 0.5770 for the base BAGEL model.

Why is this approach superior to simply using an external image restoration tool?

External restoration models are often optimized for perceptual quality rather than downstream reasoning, and they struggle with unknown or compound corruptions. Robust-U1’s internal recovery is task-aligned, meaning the model learns to reconstruct exactly the visual features necessary for accurate multimodal reasoning.

Does the model always need to perform recovery, even on clean images?

The current framework performs recovery as a standard part of the pipeline. While this incurs higher latency, the authors note that a "detect-then-recover" variant is a viable alternative for production environments where latency budgets are tight.

Robust-U1 demonstrates that explicit visual reconstruction is a more effective strategy for MLLM robustness than implicit feature alignment or text-based compensation. Researchers should prioritize task-aligned generative recovery when building models for high-stakes visual understanding.

Abstract

Robust‑U1 adds visual self‑recovery to MLLMs for stronger robustness.

Robust‑U1 introduces a three‑stage pipeline that equips MLLMs with explicit visual self‑recovery. It first learns reconstruction via supervised fine‑tuning, then refines it with reinforcement learning that balances pixel‑level SSIM and semantic‑level CLIP similarity, and finally performs multimodal reasoning over both the corrupted and recovered views.

The Robustness Gap in MLLMs

We expose MLLMs' vulnerability to visual corruptions and motivate a self‑recovery solution.

Multimodal Large Language Models (MLLMs) excel at aligning visual and textual cues, yet they collapse when faced with real‑world visual corruptions such as sensor noise, compression artifacts, or adverse weather. Existing robustness strategies either hide the corruption behind black‑box feature alignment or rely on white‑box textual reasoning, both of which ignore the pixel‑level information needed for faithful perception. This motivates the central question: can an MLLM internally recover corrupted visual content before reasoning?

When an image is degraded by noise, compression, or weather effects, the visual encoder extracts distorted features, causing the downstream language model to misinterpret the scene.

**Figure 1.** Comparison of robustness enhancement paradigms. (A) **Implicit Adaptation**: Black-box feature alignment within the visual encoder. (B) **Text-based Reasoning**: White-box textual chain describing corruption impacts. (C) **Our Robust-U1 (Self-Recovering)**: Direct visual self-recovery and multimodal reasoning over both corrupted and recovered images.

MLLMs fail under real‑world visual noise, necessitating a recovery‑first approach.

Related Work

Survey of prior approaches to visual robustness in multimodal models.

Multimodal large language models (MLLMs) remain vulnerable to visual corruptions, which often cause severe drops in perception performance. This has made robustness a central research focus, and existing work falls into two broad strategies: implicit alignment and text‑based reasoning.

Implicit alignment methods such as TeCoA, Robust LLaVA, and Robust CLIP fine‑tune the visual encoder with adversarial examples to resist localized distortions, but their reliance on limited adversarial datasets hampers generalization.

Text‑based reasoning approaches, exemplified by Robust‑R1, augment interpretability by appending explicit textual descriptions of corruption types. More recent efforts (Zheng et al., 2025) incorporate visual cues into the reasoning chain, and multi‑agent extensions further coordinate specialized agents for long‑form video tasks; Thinking with Generated Images explores generating auxiliary visual representations to aid inference.

Self‑Recovery denotes a model’s ability to reconstruct a corrupted image internally before answering questions, thereby providing a cleaner visual signal for downstream reasoning.

The Robust-U1 Framework

Three-stage Robust‑U1 pipeline equips an MLLM with visual self‑recovery and dual‑reward refinement.

The standard multimodal pipeline feeds a clean image $I_o$ and a textual query $Q$ into a pretrained MLLM, producing answer $A_o$.

In practice images arrive corrupted, which we denote $I_c = D(I_o)$; the degradation function $D$ captures blur, noise, compression, etc., and degrades the baseline performance.

We split robustness into three explicit stages: (1) supervised fine‑tuning teaches the model to invert the corruption, (2) reinforcement learning refines the inversion with pixel‑ and semantic‑level rewards, and (3) multimodal reasoning consumes both the corrupted and recovered views to answer questions.

**Figure 2.** Overview of the three-stage Robust-U1 framework. **Stage I:** Supervised Fine-Tuning adapts the unified MLLM to recover clean images from corrupted inputs using a rectified-flow loss. **Stage II:** Reinforcement Learning with dual rewards further enhances the quality of the recovered images via Flow-GRPO (Liu et al., 2025b). **Stage III:** Multimodal Reasoning trains the model to answer questions by jointly analyzing both the corrupted and the recovered images, leading to robust understanding.

Why not train the model end‑to‑end on corrupted inputs and answers directly?

End‑to‑end training would conflate two distinct objectives—image restoration and multimodal reasoning—making it hard for the optimizer to balance structural fidelity against downstream answer quality. By separating them, each stage can use a loss tailored to its goal (rectified‑flow for inversion, dual‑reward for visual fidelity, language modeling for reasoning).

We reward the recovered image both for low‑level structural similarity (SSIM) and for high‑level semantic alignment (CLIP cosine similarity), ensuring that restoration preserves fine details and overall meaning.

Why combine SSIM and CLIP similarity instead of using only one?

SSIM captures low‑level texture and edge fidelity but is blind to semantic content; CLIP similarity ensures the recovered image conveys the same high‑level concepts. Using both prevents a model from over‑optimizing pixel statistics at the expense of meaning.

Compute patch means: $\mu_r = 118, 132, 124, 138$; $\mu_o = 120, 130, 125, 135$.

Since each patch is a single pixel, $\sigma_r = \sigma_o = 0$ and the contrast/structure terms reduce to 1; SSIM per patch simplifies to $l = \frac{2\mu_r\mu_o + C_1}{\mu_r^2 + \mu_o^2 + C_1}$ with $C_1=0.01$.

Plugging numbers yields $l_1 \approx 0.998$, $l_2 \approx 0.999$, $l_3 \approx 0.997$, $l_4 \approx 0.999$. Averaging gives $R_{\text{pix}} \approx 0.998$.

Pass the two 2 × 2 images through a frozen CLIP encoder (treated as a black‑box that returns 3‑dim vectors). Assume embeddings are $[0.2,0.1,0.4]$ for clean and $[0.19,0.11,0.39]$ for recovered. Cosine similarity ≈ 0.999.

With $\alpha=5$, $R_{\text{sem}} = \exp(-5\,(1-0.999)) \approx 0.995$.

Even tiny pixel deviations can keep SSIM near 1, but the exponential semantic reward penalizes any semantic drift more sharply, ensuring both low‑level fidelity and high‑level meaning are preserved.

We refine the recovery module with Flow‑GRPO: trajectories are sampled as stochastic denoising paths, advantages are computed from the composite reward, and a KL‑penalized policy update prevents reward hacking while preserving generation quality.

**Figure 3.** Schematic of the dual-reward mechanism used in the reinforcement learning stage. (A) **Pixel-Level Structural Reward:** Computes the SSIM index by comparing local patches (luminance, contrast, structure) between the recovered image $I_r$ and the ground-truth clean image $I_o$. (B) **Semantic Consistency Reward:** Utilizes a frozen TinyCLIP (Wu et al., 2023) model to extract image embeddings. The reward is derived from the cosine similarity between the embeddings of $I_r$ and $I_o$, encouraging semantic alignment in the vision-language feature space.

Experimental Setup and Performance

Robust‑U1 beats prior models on real‑world and adversarial visual corruptions.

Robust‑U1 outperforms all baselines on R‑Bench across MCQ, VQA, and CAP tasks.

Table 1 shows overall MCQ score 0.7353 versus the best baseline 0.4872, VQA 0.8272 versus 0.3704, and CAP 0.8059 versus 0.7164.

BAGEL is a unified multimodal model that handles both image understanding and text generation, serving as the starting point for our experiments.

R‑Bench is a benchmark that measures multimodal understanding under realistic visual degradations of varying intensity.

**Table 1.** Quantitative evaluation on R-Bench (Li et al., 2024) for MCQ, VQA, and CAP tasks under three degradation levels (low to high). The best/second best results are shown in Red/Blue, respectively.

**Figure.** A visual question-answering example showing an input image and a ground truth image of a street intersection with a car moving across the frame, accompanied by a multiple-choice question and the correct answer.

Robust‑U1 consistently outperforms baselines across all corruption levels on R‑Bench.

Visual Recovery Quality

Robust‑U1’s ablations show each component’s impact on corrupted visual reasoning.

We assess Robust‑U1 under three synthetic corruption intensities (25 %, 50 %, 100 %) on standard VQA benchmarks. The tables below quantify both reasoning accuracy and visual‑recovery fidelity.

Robust‑U1 reaches 83.18 % accuracy on MMMB with 100 % corruption, beating the strongest baseline by +4.7 points.

Baseline BAGEL scores 78.48 %; prior robust SOTA Robust‑R1 scores 75.35 %.

Robust‑U1’s accuracy drops only 1.57 points from clean to 100 % corruption, versus 3.44 and 6.06 points for BAGEL and Robust‑R1.

Clean‑to‑corrupted gaps: Robust‑U1 = 1.57, BAGEL = 3.44, Robust‑R1 = 6.06.

Robust‑U1 attains the highest PSNR (21.49 dB), surpassing the next best 21.45 dB.

Table 5: PSNR values – BAGEL 14.37, +SFT 20.88, +RL Rpix 21.45, +RL Rsem 21.33, Ours 21.49.

**Table 4.** Ablation study on R-Bench (Li et al., 2024) for MCQ, VQA, and CAP tasks with three degradation levels (from low to high). The best/second best results are shown in Red/Blue, respectively.

**Table 5.** Quantitative evaluation of visual recovery quality on Robust-R1 (Tang et al., 2026a) (validation set). The best/second best results are shown in Red/Blue respectively.

**Figure.** A visual question-answering example showing an input street scene and a ground truth reference image, with the question "How many buses are there?" and the correct answer "B. 2".

No caption provided.

**Figure 5.** Visual validation of $R_{pix}$. Compared with ours (Green), w/o $R_{pix}$ may produce more artifacts in pixel level (Red)

Extended Analysis and Limitations

Robust‑U1’s self‑recovery consistently lifts performance and stays stable under design variations.

Table 12 shows that inserting the recovery module yields a large boost of +0.1793 on corrupted inputs and a modest but consistent gain of +0.0044 on clean inputs. The latter indicates that the module can still clean up residual mild artifacts even when the image appears nominally clean.

Because the improvement on clean images is tiny, keeping the recovery module always active is a safe deployment default. A learned gating mechanism could further cut inference cost by skipping recovery when the input is confidently clean.

We now examine how robust Robust‑U1 is to three training‑time design choices: (i) replacing the paired semantic reward with a reference‑free alternative, (ii) varying the semantic‑reward scaling factor $\alpha$, and (iii) swapping the frozen semantic encoder.

Reference‑free semantic reward. Instead of a clean target image, we compute CLIP text–image similarity between the corrupted caption T and the recovered image Iᵣ. This removes the need for a paired clean reference during RL.

The reference‑free variant already lifts the SFT‑only baseline by +0.0463, showing that self‑recovery does not strictly require paired supervision. Nevertheless, the full method with paired structural and semantic rewards still outperforms it by a large margin (+0.1165 overall), confirming paired data remains the most effective recipe.

Reward‑scaling factor $\alpha$ study. The semantic reward is Rₛₑₘ(Iᵣ, Iₒ)=exp(−$\alpha$)·(1−Sim). We sweep $\alpha$∈{1, 2, 5, 8, 10, 15} on the validation set.

Performance varies by less than 0.6 % across $\alpha$∈[2, 8]; the default $\alpha$=5 gives the highest R‑Bench score (0.7398). Smaller $\alpha$ values make the reward too smooth, while larger $\alpha$ values produce an overly sharp exponential that harms pixel‑level quality (PSNR drops from 21.49 dB at $\alpha$=5 to 20.87 dB at $\alpha$=15).

Semantic‑encoder ablation. We replace TinyCLIP (39 M parameters) with three alternatives: CLIP‑B/16 (149 M), SigLIP‑B/16 (150 M), and a distilled weak encoder (9 M). All encoders yield comparable R‑Bench scores, with TinyCLIP still marginally best.

These results demonstrate that Robust‑U1’s gains are not tied to a specific encoder; even a heavily distilled model retains 99.3 % of the default performance.

Reliability of recovery and evaluation. We first assess whether the recovery branch introduces hallucinations that could mislead downstream reasoning.

Table 16 defines three hallucination types: structural (spurious objects), semantic (incorrect attributes), and over‑sharpened (fabricated fine details).

Table 17 reports that Robust‑U1 achieves 92.3 % answer consistency and only 4.1 % harmful flips, far better than SFT‑only (7.2 % harmful) and the BAGEL baseline (15.6 % harmful). This confirms that semantic supervision effectively suppresses hallucinations in the recovery branch.

Evaluator sensitivity. We re‑score Robust‑U1 on R‑Bench using three LLM‑based judges: GPT‑3.5‑turbo (default), Qwen3‑Max, and GPT‑4o. All judges preserve the overall trend, with scores ranging from 0.7121 to 0.7398, and Robust‑U1 remaining substantially above all baselines.

These findings show that the main conclusions are robust to the choice of evaluator.

Qualitative results. Figure 11 presents side‑by‑side visual comparisons across five methods (corrupted input, BAGEL, + SFT, Robust‑U1, ground truth). Robust‑U1 consistently restores sharpness, correct colors, and semantic details, matching the ground‑truth appearance.

**Figure 6.** More visual comparison of recovered images across different baselines.

User‑study preferences (Table 19) reveal a strong bias toward Robust‑U1: participants preferred its outputs for semantic faithfulness in 92.3 % of comparisons and for overall visual quality in 85.7 % of cases, versus only 5.6 % and 10.1 % for the BAGEL baseline.

Limitations. Recovery quality is bounded by the underlying MLLM’s generative capacity; extreme corruptions may still leave critical details unrecoverable. Moreover, the method relies on paired corrupted–clean data, which can be costly to obtain for specialized domains.

Future work directions: (1) design lightweight recovery modules to reduce inference cost, (2) incorporate explicit corruption priors for better reconstruction, (3) extend self‑recovery to video and temporal streams, and (4) build broader benchmarks covering real‑world corruptions.

Implementation Details

Appendix meta: outlines implementation, comparisons, analyses, and limitations of Robust‑U1.

The appendix is divided into eight parts, covering implementation details, extended quantitative comparisons, mechanistic analyses, sensitivity studies, reliability assessments, qualitative results, a user study, and limitations.

Appendix A presents the training cost breakdown (Table 7) and the evaluation protocols used on R‑Bench and three anti‑degradation benchmarks.

Training proceeds in three stages: Stage I (SFT) consumes 1920 GPU‑hours on ImageNet‑C, Stage II (RL) adds 160 GPU‑hours, and Stage III (joint reasoning) adds 64 GPU‑hours; only the final stage updates both understanding and generation modules.

For R‑Bench, MCQ accuracy is computed as the fraction of correct answers, while VQA and captioning are scored by GPT‑3.5‑turbo on completeness, accuracy, and relevance, averaged over the test set.

On the anti‑degradation benchmarks (MMMB, MMStar, RealWorldQA) we use the same MCQ accuracy metric, with GPT‑3.5‑turbo parsing model outputs via VLMEvalKit.

Appendix B reports two extended comparisons: (i) external restoration modules applied before a strong discriminative MLLM, and (ii) a detect‑then‑recover variant evaluated at inference time.

**Table 8.** Comparison with external restoration modules on R-Bench (Li et al., 2024). Restoration baselines are applied as a preprocessing step before Qwen2.5-VL-7B (Bai et al., 2025). The best results are shown in Red.

**Table 9.** Inference-time comparison on R-Bench (Li et al., 2024). “Rec. Mem.” / “Und. Mem.” denote peak GPU memory for the recovery and understanding stages, respectively.

Appendix C investigates why internal self‑recovery improves reasoning, separating the effects of reconstruction supervision from chain‑of‑thought (CoT) supervision.

**Table 10.** Component isolation on R-Bench. “Recon.” indicates whether reconstruction-based recovery supervision (SFT only or SFT+RL) is used, and “CoT” indicates whether reasoning-chain supervision is used.

**Table 11.** Recovery quality (PSNR) vs. downstream reasoning performance (R-Bench)

**Table 12.** Effect of always-on recovery on clean and corrupted inputs (R-Bench Overall).

Read the original paper

Open the simplified reader on Paperglide