Blog Multimodal LLMs & Texto

An LMM for Precisely Grounding Elements in Documents

arXiv:2606.24118v1 Announce Type: new Abstract: Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise eleme...

arXiv cs.CV ·Yijian Lu, Chuangxin Zhao, Kai Sun, Lei Hou, Juanzi Li, Ji Qi · 24 de janeiro de 2026

Ver no Hugging Face

// relacionados

An LMM for Precisely Grounding Elements in Documents

Leia também

Cosmos 3: o primeiro modelo aberto que vê, simula e age no mundo físico

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy

VisChronos: Revolutionizing Image Captioning Through Real-Life Events