Blog LLMs & Texto

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

arXiv:2607.00115v1 Announce Type: new Abstract: This paper explores multi-turn visual reasoning and observes that MLLMs repeatedly fail to localize the target, leading to long, redundant trajectories. We attribute this failure to the entanglement of reasoning and perception within a single model, the MLLM reasons and localizes simultaneously, and inaccurate localization triggers additional reasoning turns that bloat the trajectory. To solve this problem, we propose PixelEyes, a multi-turn visual...

arXiv cs.CV ·Dengxian Gong, Yuanzheng Wu, Haobo Yuan, Zhengdong Hu, Tao Zhang, Yikang Zhou, Shihao Chen, Quanzhu Niu, Kai Wang, Jason Li, Haochen Wang, Lu Qi, Shunping Ji, Ming-Hsuan Yang · 02 de janeiro de 2026

Ver no Hugging Face

// relacionados

PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking

Leia também

Claude Sonnet 5: a Anthropic aposta que o modelo do meio faz o trabalho do topo

Google’s AI buildout drove 37% increase in electricity use in 2025

OpenAI reportedly offers the Trump administration a five percent stake in the company

The Google Health API Got a CLI: ghealth is an Open-Source Tool for Your Fitbit Air Data