Blog Multimodal

Listening makes Vision Clear for VLMs

arXiv:2606.23763v1 Announce Type: new Abstract: Recent work typically assesses vision--language consistency using attention distributions of answer-side tokens. However, we observe that highest attention regions are not always consistent with the intended semantic token. This probably stems from decoding drift, where language priors from previously generated answer tokens accumulate and mismatch with visual attention. Besides the priors from previous answer tokens, we find that structural tokens...

arXiv cs.CV ·Yiyang Chen, Yixin Tan, Binrui Shen ·
compartilhar: