A Validation-Gated Mechanistic Account of Suicidality Detection in LLMs

arXiv:2606.21078v1 Announce Type: new Abstract: Large language models are increasingly proposed for mental-health applications such as detecting suicidal content, raising the question of what they rely on. We study this mechanistically and use it to ask a narrower question: how to make a causal claim about a model's internal features more trustworthy. Our validation-gated framework, with suicidality detection as a case study, interprets a behavior only after the model is shown to perform it: a c...

arXiv cs.CL ·Nafiz Ahmed, Sarah Sharif, Dingjing Shi, Mike Banad ·
compartilhar: