GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping

Yuhang Zheng^1,2, Xiangyu Chen², Yupeng Zheng⁴, Songen Gu⁵, Runyi Yang⁶, Bu Jin⁴, Pengfei Li⁵,
Chengliang Zhong⁵, Zengmao Wang¹⁰, Lina Liu⁸, Chao Yang⁷, Dawei Wang³, Zhen Chen²,
Xiaoxiao Long^3,9*, Meiqing Wang^1*,
(^*Indicates Corresponding Author)

¹ Beihang University, ² EncoSmart, ³ HKU, ⁴ CASIA, ⁵ Tsinghua AIR, ⁶ Imperial College London, ⁷ Shanghai AI Laboratory, ⁸ China Mobile Research Institute, ⁹ LightIllusions, ¹⁰ Wuhan University

Paper Data Code

Abstract

Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks.

Pipeline of GaussianGrasper

The architecture of our proposed method. (a) is our proposed pipeline where we scan multi-view RGBD images for initialization and reconstruct 3D Gaussian field via feature distillation and geometry reconstruction. Subsequently, given a language instruction, we locate the target object via open-vocabulary querying. Grasp pose candidates for grasping the target object are then generated by a pre-trained grasping model. Finally, a normal-guided module that uses surface normal to filter out unfeasible candidates is proposed to select the best grasp pose. (b) elaborates on EFD where we leverage contrastive learning to constrain rendered latent feature L and only sample a few pixels to recover features to the CLIP space via an MLP. Then, the recovered features are used to calculate distillation loss with the CLIP features. (c) shows the normal-guided grasp that utilizes Force-closure theory to filter out unfeasible grasp poses.

Qualitative Results

Baseline: (1) LERF with extra depth supervision; (2) SAM + CLIP

3D Gaussian Field Reconstruction

* LERF is trained with extra depth supervision

Language-guided Grasp

Scene Update

1. Language-guided pick-and-place2. Scan new scene

3. Update the scene4. Continuous grasp on the same object

Acknowledgement

We would like to thank DRL group of CASIA for supporting device. And we would like to thank Prof. Qichao Zhang and Prof. Haoran Li for their valuable and insightful suggestions.