Geometry-Aware Representation Denoising for Robust Multi-View 3D Reconstruction

Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

GARD restores degraded multi-view images by denoising directly within the geometry-aware feature space of a 3D reconstructor.

How can we improve 3D reconstruction from degraded images by denoising latent representations instead of raw pixels?

Feed-forward 3D reconstruction models excel on clean images but fail when inputs are degraded by motion blur, as errors propagate through the network and destroy geometric consistency. The authors propose Geometry-Aware Representation Denoising (GARD), a diffusion-based denoiser that operates directly on the intermediate feature representations of a frozen 3D reconstructor. By denoising in this geometry-aware space rather than pixel or VAE-latent space, the model preserves structural fidelity and cross-view consistency. This approach enables simultaneous recovery of high-quality RGB images and accurate 3D scene geometry in a single forward pass, consistently outperforming existing restoration pipelines on the Depth Anything 3 benchmark.

Paper Primer

Existing "restore-then-reconstruct" pipelines suffer because single-view restoration ignores multi-view geometry, while VAE-based latent restoration discards the fine-grained details necessary for accurate 3D correspondence. GARD bypasses these bottlenecks by inserting a diffusion-based denoiser into the encoder of a pretrained 3D reconstructor, effectively treating the reconstructor's own internal features as the domain for restoration.

GARD significantly improves camera pose estimation and 3D reconstruction accuracy under severe motion blur.

Quantitative evaluation on the Depth Anything 3 benchmark shows GARD consistently outperforms both single-view restoration models (e.g., Restormer, HI-Diff) and multi-view VAE-based baselines across all datasets. GARD achieves the highest AUC scores for pose estimation and the best F-scores for 3D reconstruction across five diverse benchmarks.

Operating in geometry-aware feature space preserves structural fidelity better than compressed latent spaces.

Ablation studies and feature similarity analysis show that GARD maintains higher similarity to clean representations than VAE-based baselines, which suffer from scattered and ambiguous correspondences. GARD consistently yields higher PSNR and lower LPIPS scores for image restoration compared to all baseline methods.

Why is the reconstructor's internal feature space better for denoising than standard image space?

Standard image-space restoration lacks multi-view context and geometric constraints. By denoising in the reconstructor's feature space, GARD leverages representations already optimized for cross-view reasoning and scene geometry, ensuring that the restoration process respects the underlying 3D structure.

Does this method require retraining the entire 3D reconstruction backbone?

No. The GARD denoiser is inserted into a frozen feed-forward reconstructor. The framework fine-tunes an auxiliary RGB decoder for image recovery while the geometry decoder remains part of the original, fixed backbone.

Researchers can now improve the robustness of 3D reconstruction systems by treating the model's internal feature representations as a target for restoration, rather than attempting to clean inputs before they enter the pipeline.

The Challenge of Degraded Inputs

We expose why feed‑forward 3D reconstructions crumble under degraded inputs and motivate denoising in geometry‑aware feature space.

Feed‑forward 3D reconstruction models infer scene geometry in a single forward pass of geometry‑aware features. In real‑world multi‑view capture, degradations such as motion blur obscure fine textures and break cross‑view matching, causing the learned features to lose discriminability. Because the models lack an explicit correction step, these errors propagate and the final reconstruction collapses.

Number of pairwise view interactions: $N^2 = 256^2 = 65{,}536$.

Each interaction stores a $d$‑dimensional vector: $65{,}536 \times 512 = 33{,}554{,}432$ float entries.

At 4 bytes per float, total memory $= 33{,}554{,}432 \times 4 \approx 134{,}217{,}728$ bytes ≈ 128 MB.

This memory footprint grows quadratically with the number of views, quickly becoming prohibitive for high‑resolution multi‑view setups.

A curated suite that measures multi‑view 3D reconstruction quality when inputs are deliberately degraded (e.g., motion blur).

Feed‑forward models fail under degradation because they lack a dedicated denoising stage to correct corrupted geometry‑aware features.

Prior Approaches to Robust Reconstruction

We review prior approaches to robust multi-view reconstruction, image restoration, and representation learning.

Feed‑forward 3D reconstruction models have replaced traditional multi‑stage pipelines by producing scene geometry in a single forward pass. This design yields fast inference but assumes clean inputs, making the models brittle when faced with real‑world degradations.

Earlier robust reconstruction works target distractor views. RobustVGGT introduces an outlier rejection step, while VGTW learns a dedicated head to predict and suppress distractor objects. Our approach is orthogonal, concentrating on image degradations rather than view‑level outliers.

Camera motion blur is especially problematic because it blurs fine textures, edges, and structural details, which in turn corrupts geometric correspondence estimation and hampers accurate 3D reconstruction.

Image restoration aims to recover clean images from degraded observations such as blur, noise, or low resolution. Early CNN‑based methods were superseded by transformer‑based models like Restormer and Hi‑Diff, and by InstructIR, which adds language conditioning. All of these operate on single images and cannot exploit complementary information across multiple views.

Video restoration models (e.g., VRT, FMA‑Net) process temporally adjacent frames and rely heavily on temporal coherence, limiting their applicability to multi‑view setups with large viewpoint changes. SIR‑Diff attempts multi‑view diffusion but works in compressed VAE latents that may discard fine‑grained visual structures.

Representation space learning leverages large‑scale pretrained encoders to obtain semantically rich, high‑dimensional feature spaces that generalize across downstream tasks. While latent diffusion models improve efficiency by operating in compact latent spaces, they often lack geometric consistency. Representation Autoencoders replace the VAE encoder with a frozen pretrained network, yielding richer latents. Building on these ideas, geometry‑aware representation spaces preserve cross‑view consistency and detailed scene structure, providing a natural substrate for diffusion‑based feature restoration.

The GARD Framework

Method introduces GARD, denoising in geometry-aware feature space to jointly restore images and 3D geometry.

Pixel‑space restoration pipelines cannot exploit multi‑view complementarity and introduce bottlenecks that corrupt geometric consistency, which directly harms downstream 3D reconstruction.

Instead of cleaning images first, we denoise the intermediate geometry‑aware features inside the frozen reconstructor, so the same cleaned representation feeds both the geometry decoder and the RGB decoder.

Encoder $E$ produces $z_1^{\text{deg}} \in \mathbb{R}^{2\times4\times2}$ and $z_2^{\text{deg}}$ (the target layer).

Apply $S_{\theta}$: $\hat{z}_2 = S_{\theta}(z_2^{\text{deg}})$, yielding a cleaned tensor of the same shape.

Pass $\hat{z}_2$ through layers $3\ldots L$, obtaining restored features $z_3^{\text{res}},\dots,z_L^{\text{res}}$.

Select four levels $M=\{L-3, L-2, L-1, L\}$ to form $Z_{\text{res}}$.

Decode geometry: $G = D(Z_{\text{res}})$ and RGB: $I_{\text{res}} = D_{\text{rgb}}(Z_{\text{res}})$.

The denoiser repairs only the corrupted layer, leaving the rest of the encoder untouched, which keeps the original multi‑view context intact.

How does GARD differ from a conventional pixel‑space denoiser that runs before the reconstructor?

Pixel‑space denoisers operate on raw images, discarding the multi‑view feature correlations that the encoder already encodes. GARD instead cleans the intermediate representation, preserving those correlations and allowing a single forward pass to produce both geometry and RGB outputs.

Features produced by the multi‑view encoder already encode depth cues and cross‑view correspondences, making them a natural substrate for denoising.

Why are geometry‑aware features more suitable for denoising than a compressed VAE latent?

Geometry‑aware features retain spatial detail and explicit multi‑view correspondence, whereas a VAE latent compresses away high‑frequency structure, which is essential for accurate depth and pose estimation.

Pixel‑space pipelines first clean images then reconstruct geometry, while feature‑space pipelines denoise inside the encoder so both tasks share a single cleaned representation.

We treat the degraded feature as the source distribution and learn a velocity field that transports it to the clean feature, avoiding a pure Gaussian start.

Why does the loss use the degraded latent as the source instead of pure Gaussian noise?

Because $z_K^{\text{deg}}$ already contains useful geometric structure; starting from it preserves that information and speeds up convergence compared to diffusing from an uninformative Gaussian.

We explicitly force the denoiser’s global attention maps to match geometrically correct correspondences, ensuring that cross‑view interactions remain faithful.

What would happen if the attention alignment loss were omitted?

The denoiser could learn spurious cross‑view attention patterns that ignore true geometric correspondences, leading to degraded depth accuracy and inconsistent multi‑view reconstructions.

Encode degraded images $I_{\text{deg}}$ with the frozen multi‑view encoder $E(\cdot)$ to obtain layer‑wise features $\{z_l\}_{l=1}^{L}$.

Apply the GARD denoiser $S_{\theta}(\cdot)$ at layer $K$ to produce $\hat{z}_K$.

Propagate $\hat{z}_K$ through the remaining encoder layers, yielding restored features $\{z_l^{\text{res}}\}_{l=K}^{L}$.

Select four feature levels $Z_{\text{res}}$ and feed them to the geometry decoder $D(\cdot)$ and the RGB decoder $D_{\text{rgb}}(\cdot)$.

Obtain final geometry $G = D(Z_{\text{res}})$ and restored images $I_{\text{res}} = D_{\text{rgb}}(Z_{\text{res}})$.

GARD denoiser insertion in the encoder forward pass.

Pose and Reconstruction Performance

GARD sets a new pose‑estimation accuracy record on the DA3 benchmark.

GARD attains the highest AUC5 on all five DA3 benchmarks, beating the strongest baseline by up to 83.5 points.

Table 1 shows GARD’s AUC5 values (19.83, 8.18, 67.80, 85.91, 67.01) versus the best competing single‑view method Restormer (87.20, 96.65, 53.45, 84.68, 92.44) and the best multi‑view method VRT (3.69, 8.18, …).

**Table 1.** Quantitative comparison of different restoration methods across five datasets (HiRoom, ETH3D, DTU, 7Scenes, ScanNet++). The table evaluates methods based on Overall error (lower is better) and F-score (higher is better).

**Table 1.** Quantitative comparison of depth estimation performance across different datasets and methods.

**Figure 5.** Qualitative results for camera trajectory prediction. We visualize the top-down camera trajectories for degraded multi-view inputs. Compared to the baselines, the proposed GARD produces more accurate and geometrically consistent camera pose trajectories. The black dot indicates the starting camera point. Please zoom in for clearer visualization.

**Figure 6.** Qualitative 3D reconstruction results. We visualize the reconstructed 3D point clouds from degraded multi-view inputs. Compared with baseline approaches, the proposed GARD produces more accurate and geometrically consistent 3D reconstructions. Please zoom in for clearer visualization.

GARD delivers markedly superior pose estimation on DA3, establishing a new accuracy ceiling under heavy blur.

Image Restoration Quality

GARD sets the new PSNR record on DA3 image restoration.

GARD achieves the highest PSNR on the DA3 benchmark, surpassing the next‑best model by +1.37 dB.

Table 3 shows GARD (Ours) reaches 22.67 PSNR for multi‑view restoration, while VRT attains 21.30 PSNR.

Single‑view restoration models cannot exploit complementary information across views, limiting their performance under severe blur. Multi‑view approaches such as VRT and FMA‑Net are designed for temporally adjacent frames and therefore struggle with the long‑range temporal gaps present in the DA3 setup. Only the VAEMVD baseline shows a modest gain, but GARD’s feature‑space denoising yields a clear advantage.

**Table 3.** Quantitative restoration results. We report PSNR↑ and LPIPS↓ for image restoration on the DA3 benchmark [29], comparing single-view and multi-view restoration approaches. The best result is highlighted in **bold** and the second best is <u>underlined</u>.

**Figure 7.** Qualitative image restoration results. We visualize restored RGB images from degraded multi-view inputs. Compared with baseline approaches, the proposed GARD effectively recovers high-fidelity multi-view images while preserving fine-grained details. Please zoom in for clearer visualization.

Component Analysis

Removing each training component or reducing views degrades performance, revealing their impact.

The full GARD configuration (both interpolated flow and attention alignment) attains the highest reconstruction F‑score.

Model D (Full) reaches 45.79 F‑score versus 44.65 for the next‑best Model C.

Increasing the number of input views raises pose‑estimation AUC30 on ETH3D by over fifteen points.

ETH3D AUC30 grows from 67.22 with 4 views to 82.35 with 10 views (Table 5).

Extended Results and Analysis

We recap that GARD denoises geometry‑aware features to keep 3D reconstruction robust to degraded inputs.

This appendix supplies additional evidence for GARD’s effectiveness. We first describe a similarity analysis, then show cost‑volume visualizations, and finally report depth‑estimation numbers. Throughout we compare against the SIR‑Diff baseline.

SIR‑Diff is a diffusion‑based image‑restoration pipeline that iteratively denoises degraded inputs directly in pixel space.

**Figure 8.** Feature similarity analysis across layers. We evaluate the cosine similarity between the restored feature representations and the corresponding clean HQ representations across the multi-view encoder layers of the feed-forward 3D reconstruction model [29]. The GARD denoiser is applied at layer $K = 18$. Across all DA3 benchmark [29] datasets, the feature similarity of the degraded LQ representations (red) progressively decreases in deeper layers due to the accumulation of degradation-induced feature distortion. In contrast, the restored representations produced by the GARD denoiser (blue) recover feature representations that remain substantially closer to the clean HQ representations throughout the deeper layers. This demonstrates that the GARD denoiser effectively restores geometry-aware features and suppresses degradation propagation, resulting in substantially improved consistency with the clean HQ representations. Please zoom in for clearer visualization.

**Figure 9.** Correspondence visualization of feature cost volumes. Cross-view correspondence visualization of feature cost volumes constructed from VAE [23], DINOv2 [40], and DA3 [29] feature cost volumes. Please zoom in for clearer visualization.

**Figure 10.** Multi-view image restoration results of SIR-Diff [34]. We show qualitative results on the DA3 benchmark [29] based on our reproduction of SIR-Diff.

**Figure 11.** Visualization of target correspondence maps We visualize the effect of attention alignment training which augments the learning of the attention of the GARD denoiser.

**Figure 12.** Visualization of attention alignment We visualize the effect of attention alignment training which augments the learning of the attention of the GARD denoiser.

**Figure 13.** Qualitative camera pose estimation on the DA3 benchmark [29]. We visualize the top-down camera trajectory results for ten input views. The black dot indicates the starting camera point. Please zoom in for clearer visualization.

**Figure 14.** Qualitative 3D reconstruction results on the DA3 benchmark [29]. We visualize the 3D reconstruction point cloud results for ten input views. Please zoom in for clearer visualization.

**Figure 15.** Qualitative image restoration results on the DA3 benchmark [29]. We visualize three selected views out of ten input views for each dataset. Please zoom in for clearer visualization.

**Figure 16.** Qualitative depth estimation results on the DA3 benchmark [29]. We visualize three selected views out of ten input views for each dataset. Please zoom in for clearer visualization.

**Table 6.** Quantitative depth estimation results. We report AbsRel↓ and $\delta$1 ↑ across the DA3 benchmarks [29]. The best result is highlighted in bold and the second best is underlined.

Backbone Capacity Analysis

Implementation details and evaluation metrics for the GARD experiments.

We assess GARD’s denoiser on a smaller multi‑view encoder (DA3‑BASE) that has 12 transformer layers and 0.11 B parameters, inserting the denoiser at encoder layer K = 4 (named GARDBASE).

Evaluation follows the DA3 protocol: pose estimation uses AUC of Relative Rotation and Translation Accuracy at 5°/30° thresholds; 3D reconstruction reports accuracy, completeness, and their average Chamfer Distance, plus an F1‑score; depth estimation uses standard depth metrics; image restoration is measured with PSNR (higher is better) and LPIPS (lower is better).

**Table 9.** Quantitative 3D reconstruction results. We report Overall↓ and F-Score↑ for 3D reconstruction on the DA3 benchmark [29] using the DA3-BASE [29] checkpoint as the feed-forward 3D reconstruction model. The best result is highlighted in **bold** and the second best is <u>underlined</u>.

Additional Quantitative Results

Detailed quantitative results and implementation specifics for GARD’s restoration pipeline.

Table 9 reports overall and F‑score metrics for 3D reconstruction on the DA3 benchmark, comparing several single‑view and multi‑view restoration baselines against our GARDBASE model.

Section B.2 details how the GARD framework restores RGB images by re‑using the ViT‑based decoder from GLD, initializing it from the DA3‑BASE checkpoint and adding a linear projection adapter to match the DA3‑GIANT feature dimension.

Four multi‑scale feature levels $M=\{20,28,34,40\}$ are concatenated and fed to the RGB decoder, enabling joint reconstruction of high‑quality images alongside geometry.

Section B.3 describes training on the synthetic Hypersim and TartanAir datasets, which together provide indoor‑high‑fidelity and outdoor‑dynamic scenes for robust geometric learning.

Training runs for 10 epochs with a global batch size of 8 on NVIDIA H100 GPUs, using AdamW ($\text{lr}=2\!\times\!10^{-4}$, $\beta_1=0.9$, $\beta_2=0.95$) and EMA decay $0.9995$; gradients are clipped to norm 1.0.

During inference (B.4), degraded multi‑view inputs are encoded by DA3‑GIANT‑1.1, the $K\!=\!18$‑th encoder layer’s geometry‑aware representation is denoised via a 50‑step Euler ODE solver, and the restored features are passed through the geometry and RGB decoders.

**Table 7.** Quantitative evaluation of SIR-Diff [34]. We report the pose estimation and 3D reconstruction results on the DA3 benchmark [29], comparing SIR-Diff [34] with our GARD framework.

Read the original paper

Open the simplified reader on Paperglide