Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

DIRECT enables precise 3D pose-controllable object insertion by decomposing guidance into geometry, appearance, and context pathways.

How can we insert an object into a background image while maintaining precise 3D pose control and high-fidelity appearance?

Current generative models treat object insertion as a 2D inpainting task, making it impossible to precisely align an object’s 3D pose with a target scene. DIRECT solves this by lifting a reference image into a 3D proxy, which is then rendered to provide explicit geometric guidance that the model follows alongside the original appearance and scene context. This approach outperforms existing methods in both geometric accuracy and visual fidelity, allowing for complex pose transformations that were previously prone to distortion or identity loss.

Paper Primer

The core mechanism is a Decomposed Injection Strategy that prevents feature entanglement. By assigning independent LoRA adapters and positional embeddings to geometry, appearance, and context tokens, the model learns to extract structural pose information from the 3D proxy without sacrificing the high-fidelity texture of the reference object.

DIRECT achieves superior pose accuracy and identity preservation compared to cascaded 3D-aware editing baselines.

Quantitative evaluation on a 160k-pair hybrid dataset shows consistent improvements in CLIP-I, DINO, and a new dense Matching Error metric. The method reduces Matching Error from 26.9 to 22.7 and increases CLIP-I identity scores from 0.904 to 0.943.

To support this, the authors built an automated data pipeline that synthesizes novel-view references from real-world images. This hybrid dataset of 160k pairs allows the model to generalize to complex, in-the-wild scenes that standard object-centric datasets lack.

Why is a 3D proxy necessary if modern diffusion models are already excellent at inpainting?

Text-based or 2D-only models lack explicit spatial correspondence, leading to "hallucinated" poses when users request specific orientations. The 3D proxy provides a dense geometric condition that forces the model to adhere to a specific 6-DoF pose.

What is the primary failure mode of this framework?

The model’s performance is strictly bounded by the quality of the upstream 3D reconstruction. If the initial 3D proxy suffers from severe geometric distortion, such as incorrect aspect ratios, those errors propagate directly into the final synthesized output.

Researchers can now achieve precise 3D-aware object insertion by treating geometry as a separate, explicit conditioning signal rather than an abstract text prompt or a 2D spatial constraint.

Introduction and Motivation

We expose why 2‑D inpainting fails at pose‑controlled insertion and outline our 3‑D‑aware solution.

Standard 2‑D inpainting models excel at filling missing pixels but have no notion of 3‑D geometry. Consequently they cannot honor user‑specified object poses, leading to misaligned or implausible insertions. This limitation is the core problem we address.

Recent diffusion pipelines such as Nano Banana Pro treat insertion as a pure 2‑D fill‑in, while parametric 3‑D‑aware approaches like Object3DIT rely on sparse rotation parameters that are hard to translate into dense pixel‑level deformations. Both strategies leave a gap between the desired 6‑DoF pose and the generated image.

Our solution, DIRECT (Direct 3D‑Aware Object Insertion), lifts the reference object to a coarse 3‑D proxy, renders it under the target pose, and injects three complementary signals—appearance, geometry, and context—through independent pathways. This decomposition prevents feature entanglement, preserves the object’s visual identity, and enforces the exact pose dictated by the user.

**Figure 1.** **Pose-controllable object insertion.** (a) Existing pipelines have difficulty placing the reference object in a reasonable and user-specified pose within the background image, even when using a strong 2D generative model such as Nano Banana Pro (Google, 2025) or a 3D-aware editing model such as Object3DIT (Michel et al., 2023). In contrast, our framework inserts the object with precise pose control and better scene alignment. (b) Additional results show that our method achieves high-fidelity insertion with precise pose control in complex real-world scenes. We report the prompts used for competing methods in Appendix A.

The shift from 2‑D inpainting to 3‑D‑aware insertion unlocks reliable pose control while retaining high‑fidelity synthesis.

Related Work

A survey of prior insertion, editing, and 3D generation methods, emphasizing their geometric limitations.

Related work spans object insertion, 3D‑aware editing, and image‑to‑3D generation. We highlight how prior methods handle geometry and where they fall short of our pose‑controllable insertion.

Injects feature maps into a diffusion model to improve fidelity of inserted objects.

Uses a learned door‑like gating mechanism to blend inserted objects with the background.

Adopts a copy‑paste‑harmonize pipeline for object insertion.

Reformulates object insertion as a unified inpainting task using the FLUX backbone and a “diptych” design.

Fine‑tunes a generative model with encoded camera parameters or bounding boxes.

Training‑free method that manipulates diffusion features via inversion.

Guides diffusion with geometric constraints derived from a proxy.

Manipulates diffusion latent space to sculpt image content.

Leverages intrinsic 3D cues from a single view to guide composition.

Copies a 3D asset into a scene and renders it for compositing.

Uses a rendered proxy as a guidance signal for in‑place editing.

Employs a 3D proxy to steer diffusion for editing tasks.

Encodes explicit geometry signals (e.g., camera pose, bounding box) into a diffusion model to steer object insertion.

Predicts a dense 3D representation from a single image using a diffusion backbone, enabling high‑fidelity geometry synthesis.

Optimizes a NeRF representation via Score Distillation Sampling, producing 3D assets from text.

Uses SDS to generate 3D assets from textual prompts, similar to DreamFusion.

Feed‑forward transformer that regresses a 3D representation from a single image in seconds.

Transformer‑based model that directly outputs a 3D mesh from an image.

3D diffusion model that generates Gaussian‑splatting representations from a single view.

Combines diffusion with a hierarchical representation to produce high‑quality 3D assets.

In contrast, our DIRECT framework lifts a single image into an explicit 3D proxy and injects it, the reference image, and scene context through decomposed pathways, achieving precise pose control without high‑quality 3D assets or test‑time optimization.

The DIRECT Framework

Our method decomposes object insertion into geometry, appearance, and context pathways to achieve pose‑controllable generation.

The framework splits object insertion into three isolated pathways—geometry, appearance, and context—so each requirement can be satisfied without the others interfering.

The model receives three visual signals—a geometry image, a reference appearance image, and a background context image—that together steer pose, identity, and scene integration.

Encode $I_{ref}$ → latent tokens $z_{ref}$ (captures red color).

Encode $I_{geo}$ → latent tokens $z_{geo}$ (captures rotated orientation).

Encode $I_{bg}$ with frozen encoder → global tokens $c_{global}$ (captures gray tone).

Form the noisy diffusion latent $z_t$ at timestep $t$ (random noise).

Concatenate $Z=[c_{global}, z_t, z_{ref}, z_{geo}]$ and run through the DiT block.

Decode the output latent to an image where the red square appears upright (matching $I_{ref}$) but placed according to the pose encoded in $I_{geo}$, seamlessly blended into the gray background.

This toy example shows how each component contributes a distinct piece of information—color from $I_{ref}$, pose from $I_{geo}$, and scene consistency from $c_{global}$—without any one signal dominating the others.

Encode the appearance image $I_{ref}$ into latent tokens $z_{ref}$.

Encode the geometry image $I_{geo}$ into latent tokens $z_{geo}$.

Encode the full‑frame background $I_{bg}$ with a frozen SIGLIP encoder to obtain global tokens $c_{global}$.

At diffusion timestep $t$, obtain the noisy latent $z_t$.

Form the unified sequence $Z = [c_{global}, z_t, z_{ref}, z_{geo}]$.

Apply independent Rotary Positional Embeddings to $z_{ref}$ and $z_{geo}$, keeping them spatially isolated.

Pass $Z$ through the DiT block where separate LoRA adapters transform each modality (appearance, geometry, context).

Decode the resulting latent to produce the final image $I_{out}$ that satisfies the triplet constraints.

**Figure 2. Illustration of our framework.** The generation process is controlled by three types of conditions: appearance guidance from the original reference object, geometry guidance from the rendered image with the user-specified pose, and context guidance from global features of the background image. These conditions are injected through decomposed LoRA pathways to reduce interference. The standard masked background condition is modified by pasting the rendered object with the desired pose into the masked region. The editing region is cropped for focused local insertion and pasted back into the high-resolution image after generation.

**Figure 3. Geometric semantic ambiguity.** Standard spatial signals, such as depth and normal maps, fail to distinguish the orientation of symmetric objects, whereas our RGB geometric condition explicitly preserves semantic pose.

Data construction proceeds in two stages. First, a VLM agent (Qwen3‑VL) together with SAM‑3 proposes salient objects, segments them, and verifies mask quality by a zoom‑in check. Second, the verified object is extracted and rotated by an angle‑editing adapter to synthesize a reference image $I_{ref}$, while the original capture serves as the ground‑truth target $I_{gt}$.

Training freezes the backbone and optimizes only the LoRA adapters and linear projectors using the rectified flow matching objective. This isolates the learnable parameters to the condition‑specific pathways, ensuring stable convergence.

Shape‑Decomposed Mask Augmentation mitigates “shape leakage” by perturbing the inpainting mask so the model cannot rely on exact mask boundaries; instead it must use the geometry signal $I_{geo}$ to infer object shape.

Results and Evaluation

Experiments demonstrate superior identity preservation, pose accuracy, and visual fidelity across metrics.

We build our generator on the FLUX‑1‑Fill‑dev backbone, augment it with LoRA adapters (rank 128), and train with classifier‑free guidance (dropout 0.1). Training proceeds in two stages: 200 k steps on 4 A100 GPUs (batch 4) followed by 40 k steps on 8 A100 GPUs (batch 8). Inference uses the Euler scheduler (28 steps) with CFG scale 2.0.

For comparison we construct two families of baselines. The Stable Diffusion family combines Object3DIT or TRELLIS with the AnyDoor inserter; the FLUX family pairs the same 3D‑aware editors with InsertAnything. This yields four reference points against which we evaluate our unified approach.

Our test set consists of 200 paired images (100 from MVImgNet, 100 from SA‑1B) with manually verified pose consistency. We assess reconstruction quality (PSNR, SSIM, LPIPS), identity preservation (CLIP‑I, DINO), and pose fidelity (Matching Error).

Our method improves identity preservation, achieving CLIP‑I 0.943 versus 0.904 for the next best baseline.

Table 1 reports CLIP‑I 0.943 for our FLUX‑based model, while the strongest competitor reaches 0.904.

**Figure 6. Qualitative Comparison.** We compare our method against Object3DIT (Michel et al., 2023) and TRELLIS (Xiang et al., 2025). Our method achieves superior identity preservation and background consistency, avoiding the appearance artifacts observed in TRELLIS and the geometric distortions in Object3DIT. IA denotes InsertAnything (Song et al., 2026).

**Figure 7.** Large pose-change examples. Representative cases show substantial pose variations between the reference object and target pose. These examples require synthesis of largely unseen object views from limited reference appearance, including large rotations, top-view to side-view transformation, and near 180° viewpoint changes. Our method preserves object identity while following the specified pose.

**Figure 8. Comparison of geometry guidance signals.** Top row: Reference object, RGB/normal guidance at 0°, and RGB/normal guidance at 180°. Bottom row: Background image and the four corresponding generation results. For the symmetric road sign, the normal maps are invariant to the 180° rotation, leading to semantic ambiguity and orientation errors in the normal-based results. In contrast, our RGB proxy provides semantically rich textural cues, ensuring the model follows the desired pose.

**Figure 9. Effectiveness of the decomposed injection.** We compare our approach against a vanilla LoRA baseline that naively concatenates the appearance and geometry guidance. When the 3D proxy contains texture artifacts, the vanilla baseline suffers from feature entanglement, incorrectly inheriting degraded details. Our decomposed strategy successfully isolates these conflicting signals, leveraging the proxy for geometry guidance while preserving high-fidelity identity from the reference.

**Figure 10.** Robustness to degraded 3D proxies. In an extreme object insertion case with rich textual details on the object surface, the 3D proxy suffers from significant quality degradation. In contrast, our model inserts precise, legible details.

**Figure 11. Failure case.** The upstream model incorrectly reconstructs the rectangular reference as a square proxy. Our model strictly follows this distorted geometric condition, resulting in an incorrect aspect ratio in the final output.

**Figure 12.** Overview of the Interactive Inference Pipeline. First, the reference image is lifted into a 3D proxy. Users then manipulate the proxy over the background canvas via a visual gizmo to determine the target 6-DoF pose. Finally, the system automatically renders the necessary conditions to guide our generative framework, yielding a high-fidelity composite image that respects the user-specified pose.

**Figure 15. Performance in complex environments.** We show representative examples involving occlusion, lighting, and reflection. For occlusion, a pen is inserted into a pen holder, where the generated result exhibits a plausible depth relationship between the pen and the holder structure. For lighting, a car is inserted into a scene with strong directional illumination, and the model generates a plausible shadow consistent with the surrounding scene. For reflection, a boat is inserted onto a reflective water surface, and the generated result includes a visually plausible reflection on the background surface.

**Figure 16. Visual Demonstrations.** We showcase our model's capability to insert various objects into complex real-world backgrounds with high visual fidelity. The results show that our method supports explicit pose control (e.g., varying angles and orientations) while strictly preserving the identity and texture details of the reference objects.

Intrinsic-Guided Compositing Baseline

Standard 2D inpainting lacks 3D awareness, DIRECT splits insertion into geometry, appearance, and context pathways.

We evaluate an intrinsic‑guided compositing baseline built on TRELLIS for asset reconstruction and ZeroComp for insertion. This baseline serves as a reference point for the ablation study.

It composites a 3D asset into a target image by conditioning on intrinsic maps (e.g., albedo, depth) of both the asset and the scene, guaranteeing geometric alignment while leaving appearance fidelity to the downstream renderer.

TRELLIS+ZeroComp attains a low Matching Error of $5.2$ but suffers large drops in image fidelity and identity preservation.

Table 4 shows the baseline’s ME is the lowest, while PSNR, SSIM, LPIPS, CLIP‑I, and DINO scores are substantially lower than our method.

**Table 4.** Additional intrinsic-guided compositing baseline. TRELLIS+ZeroComp achieves low Matching Error via direct intrinsic guidance from the asset and target scene, but performs worse in image fidelity and identity preservation. ME denotes Matching Error.

**Figure 13.** Qualitative comparison with intrinsic-guided compositing. The intrinsic-guided compositing baseline provides strong geometric adherence, but struggles to preserve fine-grained reference appearance and overall image realism. In contrast, our method simultaneously achieves pose control, identity preservation, and realistic scene integration.

Sensitivity and Failure Modes

We examine DIRECT’s robustness to proxy misalignment and complex scene effects.

Standard 2D inpainting models ignore 3D geometry, so they cannot respect user‑specified object poses; DIRECT overcomes this by separating geometry, appearance, and context conditioning.

We evaluate how tolerant DIRECT is to modest errors in the user‑provided 3D proxy placement. Across six representative cases the method corrects slight height or surface misalignments and still yields natural insertions, indicating that precise proxy‑scene alignment is not required.

**Figure 14.** Sensitivity to 3D proxy-scene misalignment. We show representative cases where the user-specified 3D proxy is mildly misaligned with the target scene. In the first example, the proxy is placed slightly above the ground. In the second example, the proxy is not perfectly aligned with the supporting surface. Despite these mild proxy-scene placement errors, our method produces natural insertion results, suggesting robustness to small inaccuracies in user-specified proxy placement.

Beyond alignment, DIRECT handles challenging scene effects—occlusion, directional lighting, and reflections—without explicit physical simulation. In a pen‑holder example the inserted pen respects depth ordering; in a car scene the generated shadow matches the strong light direction; and a boat on water shows a plausible reflection on the surface.

Finally, we showcase diverse objects inserted with exact pose control while preserving identity and texture, confirming that DIRECT generalizes to complex real‑world backgrounds.

Conclusions

DIRECT delivers pose‑controllable insertion and outlines next steps.

We introduced DIRECT, a pose‑controllable object insertion framework that splits conditioning into a visual triplet—geometry, appearance, and context—processed through independent pathways, achieving state‑of‑the‑art results.

Future work will integrate end‑to‑end geometry refinement to mitigate severe proxy topology errors and further improve 3D‑aware editing.

The authors acknowledge funding from NSFC, Shenzhen Science and Technology Program, the Zhongguancun Academy, NTU S‑Lab, and industry partners.

This work advances generative media and AR by lowering barriers to high‑fidelity, controllable composition, yet it also raises misuse concerns for photorealistic manipulation of visual evidence.

We recommend responsible deployment, including digital watermarking and provenance tracking, to mitigate potential abuse.

The appendix details prompts for baselines (A), data construction (B), inference pipeline (C), a comparison with an intrinsic‑guided compositing baseline (D), latency and memory analysis (E), sensitivity to 3D proxy misalignment (F), performance under complex conditions (G), and extra visual results (H).

Implementation Details and Data

Details on training data, prompts, and interactive inference steps

Figure 1 compares two competing baselines. Nano Banana Pro (Google, 2025) inserts a book from the first image onto the second‑shelf‑from‑the‑top, leaning against the right‑hand books. Object3DIT (Michel et al., 2023) simply rotates the same book by 320°.

Our hybrid training set mixes automatically curated pairs from SA‑1B with filtered multi‑view samples from MVImgNet. For SA‑1B we enforce a three‑step curation: (1) object image curation keeps only fully visible, unoccluded objects with precise masks; (2) SAM‑3 generates candidate masks, discarding those that touch image borders or are too small; (3) a verification zoom‑in checks structural completeness before acceptance.

Using the verified mask, we remove the background and synthesize a novel‑view reference image with an image‑editing model, while the original real image remains the supervision target. MVImgNet samples are kept only if they pass Tenengrad, CLIP‑IQA, and Q‑Align quality scores, ensuring sharp, well‑exposed views. Table 3 shows that training on this hybrid data markedly improves identity preservation and pose accuracy.

Despite careful filtering, the dataset inherits source biases such as category imbalance and occasional mask or synthesis artifacts. The interactive inference pipeline then takes a user‑specified 3D proxy $P$ and lifts it to a deterministic 6‑DoF pose $\xi$ and insertion region $M$, which feed the generator.

Proxy lifting first extracts a clean reference object image $I_{ref}$ via foreground segmentation, then passes it through TRELLIS to obtain $P$ as a set of 3D Gaussians displayed in an interactive viewer. After the user aligns $P$, the system renders it at $\xi$ to produce $I_{render}$ and its binary alpha mask $m$, builds a composite background $I_{paste}$, recenters the object to get $I_{geo}$, and finally assembles the tuple $(I_{ref}, I_{geo}, I_{paste}, I_{bg}, M)$ for the generative model.

Inference Latency and Memory

We report latency and memory costs, showing our method matches baselines overall.

We evaluate inference latency and peak GPU memory in the SD-based setting, breaking the runtime into 3D proxy generation, 2D generation, and other processing steps. This lets us compare our approach against SD-based baselines.

**Table 5.** Inference latency and memory overhead. We report the runtime breakdown, overall end-to-end latency, and peak allocated GPU memory in the SD-based setting.

Read the original paper

Open the simplified reader on Paperglide