UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

CVPR 2026
Chuanrui Zhang1,2* Yingshuang Zou4* Zhengxian Wu5* Yonggen Ling2,3†✉
Yuxiao Yang5 Ziwei Wang1✉
1NTU    2Tencent Robotics X    3Futian Laboratory    4HKUST    5THU
*Equal Contribution    Project Leader    Corresponding Authors
Paper Code
architecture

TL;DR

We present UniPR, an end-to-end stereo framework that unifies 6D pose estimation and metric-scale 3D shape reconstruction, achieving up to 100× faster generation and 3× better shape-proportion accuracy for real-to-sim robotic manipulation.

Abstract

Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.

Architecture

architecture
Overview of UniPR. Taking stereo image pairs as input, UniPR first encodes the scene into Tri-Plane View features that comprehensively capture spatial and geometric information. Within the transformer decoder, object queries are employed to extract instance-specific features from these TPV embeddings, enabling the network to reason about multiple objects in parallel. The resulting object embeddings are then fed into specialized prediction heads to infer each object's semantic label, 3D position, physical scale, and pose-aware shape representation.

Comparisons with the State-of-the-art

Qualitative shape reconstruction results compared with image-to-3D models:

SOTA comparisons

Qualitative pose-aware shape reconstruction results on LVS6D dataset:

SOTA comparisons