Supervise What Survives: Geometry-Guided VLA Adaptation from Synthetic Robot Videos

arXiv:2606.24448v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models require large-scale video-action pairs, yet real teleoperation remains scarce. While generated robot videos offer a scalable alternative, existing methods treat them as real robot data by recovering pseudo-actions from synthesized pixels. We argue that deriving low-level control from generated visuals is a mismatched abstraction. A video captures only \emph{geometry}: the spatial trajectory representing the \emph...

arXiv cs.RO ·Danze Chen, Yanzhe Chen, Qiming Huang, Zhijun Cao, Chen Gao, Mike Zheng Shou ·
compartilhar: