OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation

Affiliations

Yuhang Zheng^1,2*, Songen Gu^1,3*, Weize Li¹, Yupeng Zheng^1,4†, Yujie Zang², Shuai Tian⁴, Xiang Li^1,5,
Ce Hao⁶, Chen Gao^2,7, Si Liu⁷, Haoran Li⁴, Yilun Chen¹, Shuicheng Yan^2†, Wenchao Ding^2†,
(^*Equal Contribution; ^†Corresponding Author)

Overview

Dataset Statistics

Dataset Samples

Open-loop Visuo-Tactile Generation

We present both the manipulation videos and the dual-finger tactile signals generated by our visuo-tactile world model, demonstrating a high degree of consistency with the ground truth.

Assembly

Insert the Red USB

Insert the Silver USB

Insert the White Plug

Cutting

Cut the Cucumber

Cut the Pepper

Cut the Chinese Yam

Peeling

Peel the Cucumber

Peel the Radish

Peel the Chinese Yam

Wiping

Wipe the Blue Middle Vase

Wipe the White Big Vase

Wipe the White Small Vase

Adjustment

Adjust the White Cuboid then Insert it into the Socket

Adjust the Yellow Cylinder then Insert it into the Socket

Adjust the Test Tube then Insert it into the Socket

Grasping

Grasp the Cherry

Grasp the Blueberry

Grasp the Grape

Real-Robot Execution

We demonstrate the responsiveness and robustness of our slow-fast visuo-tactile manipulation framework across a variety of tasks and under perturbations.

Assembly

Cutting

Peeling

Wiping

Adjustment

Grasping

Case Study and Qualitative Results

We visualize the visual input, predicted tactile signals, future contact states, and vision-tactile fusion weights during closed-loop inference. When the vase is disturbed, the policy re-predicts contact-related signals and adjusts its action accordingly. When tactile prediction degrades, the policy gradually fails to generate reasonable actions.

We further show more real-world examples with predicted contact states and fusion weights. The results show that the policy adaptively balances vision and tactile during execution. Even with an unseen knife, it still predicts future contact states accurately and maintains effective visuo-tactile fusion.