What LLMs explain is not what they believe: Evaluating explanation sufficiency under models' own input beliefs

arXiv:2606.28615v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where free-text explanations such as chain-of-thought and post-hoc rationales are used to justify model outputs. Yet it remains unclear whether these explanations are sufficient, i.e., if they contain enough information to explain the model's output-generating process. We generalize classical sufficiency from feature attributions to arbitrary explanations and prove that ...

arXiv cs.LG ·Nhi Nguyen, Shauli Ravfogel, Rajesh Ranganath ·
compartilhar: