Category-level Object Detection, Pose Estimation and Reconstruction from Stereo Images

ECCV 2024
Chuanrui Zhang1,2* Yonggen Ling2* Minglei Lu2 Minghan Qin1
Haoqian Wang1 
1Tsinghua University2Tencent Robotics X 
* Equal Contribution 
Paper Code coming soon Dataset Video

TL;DR

We present CODERS, a one-stage approach for category-level object detection, pose estimation and reconstruction from stereo images.

Abstract

We study the 3D object understanding task for manipulating everyday objects with different material properties (diffuse, specular, transparent and mixed). Existing monocular and RGB-D methods suffer from scale ambiguity due to missing or imprecise depth measurements. We present CODERS, a one-stage approach for Category-level Object Detection, pose Estimation and Reconstruction from Stereo images. The base of our pipeline is an implicit stereo matching module that combines stereo image features with 3D position information. Concatenating this presented module and the following transform-decoder architecture leads to end-to-end learning of multiple tasks required by robot manipulation. Our approach significantly outperforms all competing methods in the public TOD dataset. Furthermore, trained on simulated data, CODERS generalize well to unseen category-level object instances in real-world robot manipulation experiments.

Architecture

architecture
Overview of Coders. We present a single-stage network capable of processing multiple unknown objects, outputting detections, classes, 6D poses and 3D shapes concurrently. Using stereo images as input, our network generates stereo-aware features for easier alignment in implicit feature space. During the transformer decoder stage, object queries interact with 3D stereo-aware features, yielding object embeddings. These object embeddings are used to infer the category, pose and shape of objects using corresponding modules, which serve as the final output. In the Implicit Stereo Matching module, CT denotes coordinate transformer.

Comparisons with the State-of-the-art

We present qualitative comparisons with the following state-of-the-art models:

comparison on TOD dataset

Real World Test

Our Coders can handle everyday objects with various surface properties.

Robot Manipulation

Our Coders can provide reliable estimation results for robot manipulation.