RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis

arXiv:2606.28385v1 Announce Type: new Abstract: Recent advances in robot world models enable synthetic video generation for embodied prediction and planning. However, evaluating these videos is challenging: visually realistic outputs often violate physical laws, temporal consistency, or task logic, while conventional metrics and monolithic Vision-Language Model (VLM) judges fail to generalize or provide precise diagnostic value. We present RoboGaze, a training-free, multi-agent VLM framework tha...

arXiv cs.RO ·Minh-Loi Nguyen, Nghiem Tuong Diep, Hung Khang Nguyen, Minh Le, Doanh Le Thien, Hoang H. Tran, Dung D. Le, Vu N. Duong, Daniel Sonntag, An Thai Le, Duy Minh Ho Nguyen, Vien Anh Ngo, Tran Van Nhiem ·
compartilhar: