An LMM for Precisely Grounding Elements in Documents

arXiv:2606.24118v1 Announce Type: new Abstract: Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise eleme...

arXiv cs.CV ·Yijian Lu, Chuangxin Zhao, Kai Sun, Lei Hou, Juanzi Li, Ji Qi ·
compartilhar: