PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao

PAR3D enables unified 3D scene understanding by modeling functional object parts alongside objects.

How can we enable 3D multimodal large language models to understand and interact with object parts, rather than just whole objects?

Existing 3D multimodal large language models (3D-MLLMs) are object-centric, meaning they struggle to identify functional components like a chair's seat or a refrigerator's handle. This structural bias prevents these models from performing fine-grained tasks like part-level referring segmentation or affordance-aware reasoning. PAR3D addresses this by introducing a part-aware framework that models parts as functional components embedded within objects. It uses a new synthetic dataset, ScenePart, to provide part-level supervision, adapts the visual backbone to capture fine-grained geometry, and generates hierarchical queries to ground both objects and parts. On part-level referring segmentation, PAR3D consistently outperforms existing 3D-MLLMs, while simultaneously achieving state-of-the-art results on standard object-level benchmarks like ScanRefer.

Paper Primer

The core mechanism hinges on Hierarchical Segmentation Query Generation: instead of using a single generic grounding token, the model generates distinct, coupled tokens for the host object and its target part. This allows the model to maintain the object-part hierarchy in its internal representation, ensuring that part-level targets are grounded within their specific functional context.

PAR3D significantly improves part-level grounding and reasoning without sacrificing object-level performance.

On the ScenePart-Seg benchmark, PAR3D achieves an mIoU of 60.7%, compared to 11.1% for the baseline 3D-LLaVA model. It also sets a new state-of-the-art for 3D-MLLMs on ScanRefer, outperforming the previous best generalist model by 6.6% absolute mIoU.

To support this, the authors introduce ScenePart, a synthetic dataset containing 800 scenes with 44,000 part masks and 273,000 language-task annotations. This dataset enables the model to learn part-aware representations through contrastive learning and self-distillation, which align the decoder's task-specific features with the pretrained encoder's semantic structure.

Why is an object-centric approach insufficient for 3D scene understanding?

Real-world interactions often require manipulating functional components—like opening a drawer or sitting on a seat—rather than interacting with an entire object. Object-centric models lack the fine-grained geometric and semantic cues necessary to distinguish these parts from their host objects.

How does PAR3D handle the lack of real-world part-level training data?

The authors constructed ScenePart, a synthetic dataset that places part-annotated objects into realistic indoor scenes. This provides the necessary supervision for the model to learn part-aware semantics, which it then generalizes to real-world scans during inference.

PAR3D demonstrates that 3D-MLLMs can move beyond object-level recognition to fine-grained part understanding by explicitly modeling the hierarchical relationship between objects and their functional components.

Introduction and Motivation

We expose the part‑awareness gap in 3D‑MLLMs and introduce PAR3D to fill it.

Existing 3D‑MLLMs treat objects as indivisible units, which hampers tasks that require manipulating object parts. To overcome this limitation we propose PAR3D (Part‑Aware 3D‑MLLM), a framework that jointly models objects and their functional components.

Current 3D‑MLLMs cannot represent or reason about object parts, yet many embodied tasks depend on fine‑grained part semantics.

**Figure 1.** We propose PAR3D, a unified 3D-MLLM with part-aware representation, together with ScenePart dataset. *Left:* ScenePart provides fine-grained object-part annotations and language instructions for 3D scenes. *Right:* PAR3D enables part-aware understanding across question answering, segmentation, and reasoning, going beyond the object-level understanding of existing 3D-MLLMs.

The shift from object‑level to part‑level understanding unlocks fine‑grained interaction capabilities in 3D scenes.

Related Work and Baselines

We situate PAR3D among prior 3D‑MLLMs and outline the 3D‑LLaVA foundation.

Prior work on 3D vision‑language models has focused on object‑level grounding, captioning, and QA, often using dedicated benchmarks such as ScanRefer or Scan2Cap.

A baseline 3D‑MLLM encodes a point cloud into a set of visual tokens, feeds them to a large language model, and uses a special [SEG] token to request an object‑level mask.

Our method builds on the 3D‑LLaVA baseline, extending its encoder‑decoder pipeline to capture part‑aware cues.

We replace the vanilla superpoint pooling with a part‑aware representation and augment the query decoder to emit hierarchical segmentation tokens for both objects and their constituent parts.

The ScenePart Dataset

We add contrastive and distillation regularizers so the decoder respects fine‑grained part geometry.

Existing 3D‑MLLMs excel at object‑level perception but cannot reason about individual object parts. This gap forces us to design a backbone that preserves fine‑grained geometry while staying compatible with a frozen encoder.

ScenePart is a synthetic indoor‑scene collection where every object is split into annotated parts, giving both object‑ and part‑level masks and language instructions.

Assign object IDs: chair = 1, table = 2; part IDs: seat = 1‑1, back = 1‑2, top = 2‑1, leg = 2‑2.

Generate a point cloud of 30 points; label each point with its object and part ID, producing four part masks.

Build a scene graph linking the chair and table via a “next‑to” relation.

Produce language prompts such as “What is the color of the chair’s seat?” using the part annotations.

This tiny example shows how part masks are inherited from object assets and how object–part correspondences enable part‑level language grounding.

Why not reuse an existing object‑level dataset instead of building ScenePart?

Object‑level datasets lack part masks, so a model trained on them never sees the geometry of individual parts. ScenePart supplies the missing supervision that lets the decoder learn part‑aware features.

We pull together decoder features that belong to the same part while pushing apart features from different parts, using a contrastive loss on superpoints.

Compute cosine similarities: $\operatorname{sim}(f_1^{d},f_2^{d})=0.9$, $\operatorname{sim}(f_1^{d},f_4^{d})=0.2$, etc.

Positive sum for anchor $i=1$: $\exp(0.9/\tau)+\exp(0.85/\tau)$ (using $\tau=0.1$).

Denominator sum includes positives plus negatives $\exp(0.2/\tau)+\exp(0.1/\tau)$.

Loss $\mathcal{L}_{\text{pcl}}$ evaluates to roughly $0.35$, indicating strong intra‑part cohesion and inter‑part separation.

The numeric example shows how a high similarity among same‑part superpoints dominates the numerator, while low cross‑part similarities keep the denominator from overwhelming the loss.

How does this contrastive loss differ from standard object‑level contrastive learning?

Standard contrastive learning treats each whole object as a single instance; here we operate on superpoints, defining positives as other superpoints of the *same part* and negatives as superpoints of *different parts*. This yields finer granularity and directly shapes part‑level feature geometry.

We keep decoder features close to the frozen encoder’s features, preventing the decoder from drifting away from the pretrained semantic space.

Cosine similarity for $i=1$: $\operatorname{sim}=0.99$ (almost aligned).

Cosine similarity for $i=2$: $\operatorname{sim}=0.96$.

Average similarity $= (0.99+0.96)/2 = 0.975$.

Loss $\mathcal{L}_{\text{rep}} = 1 - 0.975 = 0.025$, indicating minimal drift.

This tiny calculation shows how the loss directly measures the angular gap between decoder and encoder features, encouraging the decoder to stay in the encoder’s semantic cone.

Why freeze the encoder and apply stop‑gradient instead of jointly training both sides?

Freezing preserves the rich geometric and semantic priors learned on large‑scale point clouds; updating the encoder would erase those priors and blur the distinction between pretrained knowledge and task‑specific adaptation.

Together, part‑aware contrastive learning sharpens intra‑part feature cohesion, while self‑distillation anchors the decoder to the pretrained encoder, yielding a backbone that captures both fine‑grained part geometry and high‑level scene semantics.

Hierarchical Grounding Mechanism

We train PAR3D in two stages, first grounding parts, then tuning language.

Using a single grounding token [SEG] forces object‑level and part‑level queries to compete for the same slot, which blurs fine‑grained part grounding.

Instead of a single [SEG] token, the LLM first emits an object token [OBJ] and, when a part is requested, follows it with a part token [PART]—like naming a room before pointing to a specific piece of furniture inside it.

Project $h_{\text{obj}}$: $s_{\text{obj}} = W h_{\text{obj}} = (2,0)$.

Project $h_{\text{part}}$: $s_{\text{part}} = W h_{\text{part}} = (0,2)$.

Decoder $D$ computes a dot‑product between each query and the encoder feature $F_e = [(1,1), (1,1)]$, yielding similarity scores $(2,2)$ for both queries.

Softmax over the scores gives equal attention weights, so each query activates the same super‑point region.

Upsampling maps the super‑point mask to point‑level masks: the object mask covers all points of object A, the part mask isolates the subset belonging to the requested part.

The two queries share the same encoder context but remain distinct because their projected vectors differ, ensuring the part mask is always anchored to the correct object.

Why not keep a single [SEG] token and let the model infer granularity?

With only one token the model must encode both object identity and part identity in the same vector, which forces a trade‑off: improving part precision degrades object precision and vice‑versa. Separate $[OBJ]$ and $[PART]$ tokens give the model dedicated capacity for each level, eliminating that conflict.

Stage 1: Freeze the pretrained point encoder $E$; train the query decoder $D$ on ScanNet instance‑segmentation ($L_{\text{inst}}$) and on ScenePart part‑aware contrastive loss ($L_{\text{pcl}}$) plus representation‑preserving loss ($L_{\text{rep}}$).

Stage 2: Freeze the entire visual backbone; attach a projector $P$ and LoRA adapters to the LLM; jointly optimize $P$, $\phi$, and LoRA parameters on the merged instruction‑tuning corpus using text loss $L_{\text{text}}$ and mask loss $L_{\text{mask}}$.

LoRA‑based instruction‑tuning loop (Stage 2)

Evaluation Setup

PAR3D’s performance on part‑aware and object‑level 3D vision‑language benchmarks.

PAR3D reaches 69.6 mIoU on Multi3DRefer, the highest score among all evaluated 3D‑MLLMs.

Table 1 lists PAR3D (ours) at 69.6 mIoU, while the next‑best Generalist model scores 64.9 mIoU.

Beyond part‑aware tasks (ScenePart‑Seg, ScenePart‑QA), we also report object‑level results on five established benchmarks, covering referring segmentation, question answering, and dense captioning.

**Table.** Comparison of different methods on 3D vision-language tasks.

Quantitative Performance

PAR3D sets new state‑of‑the‑art scores on the ScenePart benchmark.

PAR3D outperforms all prior 3D‑MLLMs on the ScenePart benchmark, achieving the highest All‑Acc@0.5.

All‑Acc@0.5 of 81.4 % versus 78.8 % for 3D‑LLaVA and 70.6 % for Reason3D.

**Table 2.** Quantitative Comparison on the ScenePart Benchmark. We compare PAR3D with representative 3D-MLLMs on ScenePart-Seg and ScenePart-QA, covering referring segmentation at different granularities and visual question answering. The best results are highlighted in bold.

Ablation Studies

Component ablations show steady gains in part and object segmentation performance.

Adding ScenePart Data yields a massive absolute gain on part‑level segmentation.

ScenePart‑Seg mIoU rises from 11.1 % to 51.8 %.

Hierarchical Segmentation Query gives the final edge in both benchmarks.

ScenePart‑Seg mIoU climbs from 59.4 % to 60.7 % (≈+1 %) and ScanRefer from 49.2 % to 49.9 % (≈+0.7 %).

Benchmark Statistics

Key statistics of the dataset’s metric distribution are reported.

The dataset exhibits a wide spread of metric values, ranging from low single‑digit scores to values exceeding one hundred.

Dataset Construction Details

Details on evaluation metrics and how the ScenePart dataset is built.

We evaluate PAR3D on three families of tasks—referring segmentation, visual question answering, and dense captioning—using the standard task‑specific metrics from prior 3D scene‑language benchmarks. For referring segmentation we report intersection‑over‑union ($IoU$) and, uniquely for ScenePart‑Seg, $Acc@0.5$, the fraction of predictions whose $IoU$ exceeds 50 %. For VQA we report $CIDEr$, $METEOR$, $ROUGE\text{-}L$, $BLEU\text{-}4$, and for SQA3D we also include $EM$ and $EM\text{-}R$; dense captioning uses the same four language scores but only when the predicted region meets $IoU\ge0.5$, denoted with the $@0.5$ suffix.

ScenePart is built through a four‑step pipeline. First, we filter and normalize part‑annotated 3D assets from 3D‑CoMPaT, preserving their part labels and object‑part correspondences; second, MiDiffusion synthesizes indoor layouts from floor plans, placing furniture with category, pose, and scale. Third, we instantiate these layouts with the preprocessed assets, augmenting with compatible objects from 3D‑FUTURE when necessary and preserving both object and part masks; finally, we generate language‑task annotations—referring expressions, QA pairs, and captions—using template rules followed by LLM refinement. The resulting dataset comprises 800 scenes, 21 K object masks, 44 K part masks, and 273 K language annotations, with 200 K training samples and two held‑out splits of 10 K each.

**Train/test split.** We split ScenePart at the scene level to avoid leakage between training and evaluation. ScenePart-200K is constructed from annotations associated with the training scenes. The held-out scenes are used to build two evaluation splits: ScenePart-QA for part-aware visual question answering and ScenePart-Seg for referring segmentation at multiple granularities.

Dataset Statistics

Key dataset statistics and split details for ScenePart.

We partition the data at the scene level so that no scene appears in both training and evaluation, eliminating any leakage between the two splits.

**Table 5.** Breakdown of the ScenePart-Seg and ScenePart-QA Test Splits.

Questions & answers

What is PAR3D and what is its main contribution?

PAR3D (Part-Aware 3D-MLLM) is a unified 3D multimodal large language model that extends object-centric 3D scene understanding to fine-grained part-level reasoning by introducing hierarchical segmentation query generation, a part-aware visual backbone, and a new synthetic dataset called ScenePart.

What problem does PAR3D address?

PAR3D addresses the limitation of existing 3D-MLLMs, which treat objects as indivisible units and therefore cannot identify or reason about functional components such as a chair's seat or a refrigerator's handle, making them unsuitable for tasks like part-level referring segmentation or affordance-aware reasoning.

Why is an object-centric approach insufficient for 3D scene understanding?

Real-world interactions often require manipulating functional components—like opening a drawer or sitting on a seat—rather than interacting with an entire object, and object-centric models lack the fine-grained geometric and semantic cues necessary to distinguish these parts from their host objects.

How does PAR3D's hierarchical grounding mechanism work?

Instead of using a single generic grounding token [SEG], PAR3D generates two distinct coupled tokens—[OBJ] for the host object and [PART] for the target part—giving the model dedicated representational capacity for each level of the hierarchy and eliminating the trade-off that arises when a single token must encode both object and part identity.

What is the ScenePart dataset and why was it created?

ScenePart is a synthetic dataset containing 800 scenes, 44,000 part masks, and 273,000 language-task annotations, created because existing object-level 3D datasets lack part masks and therefore cannot provide the supervision needed for a model to learn part-aware representations.

How is ScenePart constructed?

ScenePart is built through a four-step pipeline: (1) filtering and normalizing part-annotated 3D assets from 3D-CoMPaT, (2) using MiDiffusion to synthesize indoor layouts from floor plans, (3) instantiating layouts with preprocessed assets augmented with compatible objects, and (4) partitioning data at the scene level to prevent train/evaluation leakage.

How does PAR3D handle the lack of real-world part-level training data?

PAR3D uses ScenePart, a synthetic dataset that places part-annotated objects into realistic indoor scenes, to provide part-level supervision, and the model generalizes the learned part-aware semantics to real-world scans during inference.

What training techniques does PAR3D use to learn part-aware features?

PAR3D uses part-aware contrastive learning, where superpoints of the same part are treated as positives and superpoints of different parts as negatives, combined with self-distillation that anchors the decoder's task-specific features to the pretrained encoder's semantic structure via a stop-gradient mechanism.

Why does PAR3D freeze the encoder rather than jointly training both encoder and decoder?

Freezing the encoder preserves the rich geometric and semantic priors learned on large-scale point clouds; jointly training would erase those priors and blur the distinction between pretrained knowledge and task-specific adaptation.

What baseline does PAR3D build upon?

PAR3D builds on the 3D-LLaVA baseline, extending its encoder-decoder pipeline by replacing vanilla superpoint pooling with a part-aware representation and augmenting the query decoder to emit hierarchical segmentation tokens for both objects and their constituent parts.

What benchmarks and tasks are used to evaluate PAR3D?

PAR3D is evaluated on three families of tasks—referring segmentation, visual question answering (VQA), and dense captioning—using benchmarks including ScenePart-Seg, ScenePart-QA, and five established object-level benchmarks such as ScanRefer and Scan2Cap, with metrics including IoU, Acc@0.5, CIDEr, and METEOR.

What are PAR3D's key quantitative results?

PAR3D consistently outperforms existing 3D-MLLMs on part-level referring segmentation (ScenePart-Seg) and simultaneously achieves state-of-the-art results on standard object-level benchmarks like ScanRefer; the paper does not reproduce specific numeric scores in the provided text.

What are the limitations of PAR3D as described in the paper?

The paper does not explicitly enumerate limitations, but implicit constraints include reliance on synthetic ScenePart data for part-level supervision (requiring generalization to real-world scans) and the need to freeze the encoder, which limits end-to-end adaptation.

How does PAR3D differ from prior 3D vision-language models?

Prior 3D vision-language models focus exclusively on object-level grounding, captioning, and QA; PAR3D is the first to explicitly model the hierarchical relationship between objects and their functional parts through dedicated [OBJ] and [PART] tokens and part-aware contrastive training.

What metrics are used for part-level referring segmentation evaluation?

For referring segmentation the paper reports Intersection-over-Union (IoU) and, uniquely for ScenePart-Seg, Acc@0.5, which is the fraction of predictions whose IoU exceeds 50%.

What metrics are used for VQA and dense captioning evaluation?

For VQA and dense captioning the paper reports CIDEr and METEOR, consistent with standard task-specific metrics from prior 3D scene-language benchmarks.

How does PAR3D's contrastive learning differ from standard object-level contrastive learning?

Standard contrastive learning treats each whole object as a single instance, whereas PAR3D operates at the superpoint level, defining positives as superpoints belonging to the same part and negatives as superpoints belonging to different parts, yielding finer granularity and directly shaping part-level feature geometry.

Where was PAR3D published and who are the authors?

The paper is available on arXiv (arxiv.org/abs/2606.06485); the provided text does not specify the authors' names or a conference/journal venue.

Key terms

3D-MLLM: A multimodal large language model that processes 3D point-cloud scene data alongside language to perform tasks such as referring segmentation, visual question answering, and dense captioning.
PAR3D: The Part-Aware 3D-MLLM proposed in this paper, which extends object-centric 3D scene understanding to fine-grained part-level reasoning using hierarchical grounding tokens and a synthetic part-annotated dataset.
ScenePart: A synthetic dataset introduced in this paper containing 800 indoor scenes with 44,000 part masks and 273,000 language-task annotations, designed to provide part-level supervision for training 3D-MLLMs.
Hierarchical Segmentation Query Generation: PAR3D's core mechanism that generates two distinct coupled tokens—one for the host object and one for the target part—so the model can ground both levels of the object-part hierarchy simultaneously.
[OBJ] token: A dedicated grounding token in PAR3D that encodes the identity and location of the host object within the hierarchical query generation framework.
[PART] token: A dedicated grounding token in PAR3D that encodes the identity and location of a specific functional part within its host object, paired with the [OBJ] token.
[SEG] token: The single generic grounding token used by prior 3D-MLLMs that PAR3D replaces with separate [OBJ] and [PART] tokens to avoid conflating object-level and part-level representations.
superpoint: A small cluster of geometrically similar points in a 3D point cloud, used as the basic unit for pooling and contrastive learning in PAR3D's part-aware backbone.
part-aware contrastive learning: A training objective in PAR3D that pulls together superpoints belonging to the same object part (positives) and pushes apart superpoints from different parts (negatives) to sharpen part-level feature representations.
self-distillation: A training technique in PAR3D where the decoder's task-specific features are aligned to the frozen pretrained encoder's outputs using a stop-gradient, preserving the encoder's semantic priors while adapting the decoder.
stop-gradient: A training operation that prevents gradients from flowing back through a particular branch of the network, used in PAR3D to keep the encoder frozen while training the decoder.
part-level referring segmentation: A task in which a model must identify and segment a specific functional part of an object in a 3D scene based on a natural-language description.
affordance-aware reasoning: The ability of a model to understand which parts of objects can be used for specific actions or interactions, such as recognizing that a handle is the part used to open a door.
IoU (Intersection over Union): A metric that measures the overlap between a predicted segmentation mask and the ground-truth mask, expressed as the ratio of their intersection to their union.
Acc@0.5: An accuracy metric used in ScenePart-Seg that reports the fraction of predictions whose IoU with the ground truth exceeds 50%.
CIDEr: A text generation metric that measures the consensus between a generated caption and a set of reference captions by weighting n-gram matches by their informativeness.
METEOR: A text generation metric that evaluates caption quality by computing a harmonic mean of precision and recall over unigrams, accounting for stemming and synonymy.
3D-CoMPaT: An existing 3D asset dataset with part annotations that PAR3D uses as a source of part-labeled objects for constructing the ScenePart dataset.
MiDiffusion: A generative model used in ScenePart's construction pipeline to synthesize realistic indoor room layouts from floor plans for placing part-annotated furniture assets.
ScanRefer: An established object-level 3D referring segmentation benchmark used to evaluate PAR3D's performance on standard object-level tasks.
Scan2Cap: An established 3D dense captioning benchmark used alongside ScanRefer to evaluate PAR3D on object-level scene description tasks.
3D-LLaVA: The existing 3D multimodal large language model that PAR3D uses as its baseline, extending its encoder-decoder pipeline with part-aware components.

Read the original paper

Open the simplified reader on Paperglide