GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping


1 Beihang University, 2 EncoSmart, 3 HKU, 4 CASIA, 5 Tsinghua AIR, 6 Imperial College London, 7 Shanghai AI Laboratory, 8 China Mobile Research Institute, 9 LightIllusions, 10 Wuhan University

Abstract

Constructing a 3D scene capable of accommodating open-ended language queries, is a pivotal pursuit, particularly within the domain of robotics. Such technology facilitates robots in executing object manipulations based on human language directives. To tackle this challenge, some research efforts have been dedicated to the development of language-embedded implicit fields. However, implicit fields (e.g. NeRF) encounter limitations due to the necessity of processing a large number of input views for reconstruction, coupled with their inherent inefficiencies in inference. Thus, we present the GaussianGrasper, which utilizes 3D Gaussian Splatting to explicitly represent the scene as a collection of Gaussian primitives. Our approach takes a limited set of RGB-D views and employs a tile-based splatting technique to create a feature field. In particular, we propose an Efficient Feature Distillation (EFD) module that employs contrastive learning to efficiently and accurately distill language embeddings derived from foundational models. With the reconstructed geometry of the Gaussian field, our method enables the pre-trained grasping model to generate collision-free grasp pose candidates. Furthermore, we propose a normal-guided grasp module to select the best grasp pose. Through comprehensive real-world experiments, we demonstrate that GaussianGrasper enables robots to accurately query and grasp objects with language instructions, providing a new solution for language-guided manipulation tasks.

Contributions

  • We introduce GaussianGrasper, a robot manipulation system implemented by a 3D Gaussian field endowed with consistent open-vocabulary semantics and accurate geometry to support open-world manipulation tasks guided by language instructions.
  • We propose EFD that leverages contrastive learning to efficiently distill CLIP features and augment feature fields with SAM segmentation prior, addressing computational expense and boundary ambiguity challenges.
  • We propose a normal-guided grasp module that uses rendered normal to select the best grasp pose from generated grasp pose candidates.
  • We demonstrate GaussianGrasper's capability of language-guided manipulation tasks in multiple real-world tabletop scenes and common objects.

Pipeline of GaussianGrasper

main_graph

The architecture of our proposed method. (a) is our proposed pipeline where we scan multi-view RGBD images for initialization and reconstruct 3D Gaussian field via feature distillation and geometry reconstruction. Subsequently, given a language instruction, we locate the target object via open-vocabulary querying. Grasp pose candidates for grasping the target object are then generated by a pre-trained grasping model. Finally, a normal-guided module that uses surface normal to filter out unfeasible candidates is proposed to select the best grasp pose. (b) elaborates on EFD where we leverage contrastive learning to constrain rendered latent feature L and only sample a few pixels to recover features to the CLIP space via an MLP. Then, the recovered features are used to calculate distillation loss with the CLIP features. (c) shows the normal-guided grasp that utilizes Force-closure theory to filter out unfeasible grasp poses.

Qualitative Results

main_graph

Baseline: (1) LERF with extra depth supervision; (2) SAM + CLIP

3D Gaussian Field Reconstruction

scene1 scene2
comparison

* LERF is trained with extra depth supervision

Language-guided Grasp

grasp1 grasp2

Scene Update

1. Language-guided pick-and-place2. Scan new scene

first scan


3. Update the scene4. Continuous grasp on the same object

update update

Acknowledgement

We would like to thank DRL group of CASIA for supporting device. And we would like to thank Prof. Qichao Zhang and Prof. Haoran Li for their valuable and insightful suggestions.

BibTeX

@article{zheng2024gaussiangrasper,
      title={GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping},
      author={Zheng, Yuhang and Chen, Xiangyu and Zheng, Yupeng and Gu, Songen and Yang, Runyi and Jin, Bu and Li, Pengfei and Zhong, Chengliang and Wang, Zengmao and Liu, Lina and others},
      journal={arXiv preprint arXiv:2403.09637},
      year={2024}}