Recent advances in generalist robot manipulation leverage pre-trained Vision Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization.
To address this, we propose a method that benefits from videos without action labels—featuring humans and/or robots in action—enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment.
We show that our method not only learns from unlabeled human and robot demonstrations—improving downstream generalist robot policies—but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
Our method enables robot actions whose labels are not available during training. Such unlabeled demonstrations may come from humans or other robots performing them. This out-of-action domain generalization is achieved using large-scale dynamic point cloud-based training, followed by action alignment on a smaller dataset with action labels.
We utilize a 3D Dynamics Predictor to learn an embodiment agnostic representation from unlabeled videos, followed by an Action Predictor for generalist manipulation. The approach uses a two-stage training process with dynamic point clouds as a common representation. We initialize the Action Predictor with the Dynamics Predictor and finetune it on a smaller dataset with action labels.
MotoVLA (R + H) achieves 68.2% average success rate in SIMPLER simulation, outperforming LAPA by 14.1% and π₀ baseline by 11.4%. Dynamic point cloud pre-training improves performance even for tasks with action supervision, demonstrating effective cross-domain motion prior learning.
Our method achieves superior performance on real robot evaluation across 8 out-of-domain tasks. Significant improvements are observed for tasks present in human demonstration data (Push Button, Cube on Scale, Cable in Basket, Clamp in Cup), demonstrating direct knowledge transfer from unlabeled cross-embodiment demonstrations.
@inproceedings{
spiridonov2025generalist,
title={Generalist Robot Manipulation beyond Action Labeled Data},
author={Alexander Spiridonov and Jan-Nico Zaech and Nikolay Nikolov and Luc Van Gool and Danda Pani Paudel},
booktitle={9th Annual Conference on Robot Learning},
year={2025},
url={https://openreview.net/forum?id=ZqBXnR6ppz}
}