ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Abstract

With the fast pace of 3D capture technology and resulting abundance of 3D data, effective 3D scene editing becomes essential for a variety of graphics applications.In this work we present ScanEdit, an instruction-driven method for functional editing of complex, real-world 3D scans.

To model large and interdependent sets of objectswe propose a hierarchically-guided approach. Given a 3D scan decomposed into its object instances, we first construct a hierarchical scene graph representation to enable effective, tractable editing. We then leverage reasoning capabilities of Large Language Models (LLMs) and translate high-level language instructions into actionable commands applied hierarchically to the scene graph.

Finally, ScanEdit integrates LLM-based guidance with explicit physical constraints and generates realistic scenes where object arrangements obey both physics and common sense. In our extensive experimental evaluation ScanEdit outperforms state of the art and demonstrates excellent results for a variety of real-world scenes and input instructions.

Video

Method

Hierarchical scan editing

In this stage, we use VLM/LLM to initialize the edited scene with a 3D transformation and a set of constraints, which are used in the graph optimization step.

We first use a VLM \(\Gamma\) and 3D heursitics to construct a hierarchical graph \(\mathcal{G}\) where the 'on top of' edge defines the hierarchy. Next, we use an LLM \(\Omega\) to reduce the input graph \(\mathcal{G}\) into a sub-graph \(\mathcal{G}_s\) with only relevent nodes and edges.

Given the input text instruction \(\mathcal{I}\) and the sub-graph \(\mathcal{G}_s\), we use an LLM \(\Psi\) to generate an edited sub-graph \(\mathcal{G}'_s\), with the new desired edges. Furthermore, each node contains a node-specific-instruction which the LLM \(\Psi\) generates based on the input sub-graph state and the instruction \(\mathcal{I}\). Next, the output edited sub-graph \(\mathcal{G}'_s\) is used to construct an instruction queue ordered following the hierarchy.

In order to place the object and generate per-node-constraints, we use another LLM \(\Phi\) which takes a node at a time to output a transformation and constraints for its children following the child-level-instruction.

Scene Sub-graph optimization

In order to achieve physical plausibility and adherence to instruction after optimization, we propose three types of losses Graph loss \(\mathcal{L}_{\mathcal{G}_s}\), Group loss \(\mathcal{L}_{G}\), and Collision loss \(\mathcal{L}_{col}\).

Graph loss \(\mathcal{L}_{\mathcal{G}_s}\):

This loss optimizes for the constraints "on top of" and "against wall", is available. These constraints are generated by the LLM, where we optimize for them using our defined "on top of" loss \(\mathcal{L}_{\text{On-Top-Of}}\) and "against wall" loss \(\mathcal{L}_{\text{AgainstWall}}\).

Group loss \(\mathcal{L}_{G}\):

Since LLMs are good at generating relative locations for groups of objects (like chairs around a table), but fail at placing them in physically plausible locations, we propose this loss which preserve the groups 3D structure during optimization.

Collision loss \(\mathcal{L}_{col}\):

In orther to resolve collision between 3d objects, we propose this loss which pushes objects that are colliding with each other, with a stop condition defined by the distance between pairs of points for a colliding pair of objects.

Baseline comparison

Editing results with machine generated instance masks

BibTeX

@misc{boudjoghra2025scanedithierarchicallyguidedfunctional3d,
      title={ScanEdit: Hierarchically-Guided Functional 3D Scan Editing}, 
      author={Mohamed el amine Boudjoghra and Ivan Laptev and Angela Dai},
      year={2025},
      eprint={2504.15049},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.15049}, 
}

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing