REALM: A Unified Red-Teaming Benchmark for Physical-World VLMs

arXiv:2606.23892v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly used as perception-reasoning backbones for embodied intelligence in safety-critical physical systems, where perception or reasoning errors can lead to unsafe decisions or actions. Although many red-teaming methods have been developed to probe VLM vulnerabilities, their evaluation remains fragmented across datasets, metrics, and threat models, making direct comparison difficult and obscuring whether obs...

arXiv cs.CV ·Yifei Zhao, Qian Lou, Mengxin Zheng ·
compartilhar: