Epistemic Goggles: A Pretrained Module that Induces an Epistemic Frame via Gradient Editing

arXiv:2607.01690v1 Announce Type: new Abstract: Finetuning a language model on documents that are explicitly annotated as fictional results in a model that still actually believes the documents' core claims, an effect known as Negation Neglect. In our evaluations, models trained on documents prefixed and suffixed with such annotations correctly identify the relevant claims as fictional only about 9% of the time. To address this, we introduce Goggles, a learned module that intervenes on the finet...

arXiv cs.AI ·Joshua Penman ·
compartilhar: