MIRTH: Mutual-Information Reasoning with Temporal Hubs for Vision-Language-Action Agents

arXiv:2606.31167v1 Announce Type: new Abstract: VLA models have emerged as a powerful paradigm for transferring semantic knowledge from web-scale data to physical robotic control. However, current single-frame architectures suffer from intrinsic limitations: temporal myopia that discards historical dynamics, reasoning gaps between high-level instructions and low-level motor commands, and inference inefficiency due to autoregressive scalar decoding. In this work, we propose MIRTH, a unified frame...

arXiv cs.RO ·Hao Sun, Yu Song, Shiyu Teng, Ziwei Niu, Yen-Wei Chen ·
compartilhar: