The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection

arXiv:2607.02322v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships....

arXiv cs.RO ·Jincheng Tang, Yilong Zhu, Zhengyuan Xie, Jiang-Jiang Liu, Jiaxing Zhang ·
compartilhar: