Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

arXiv:2606.30989v1 Announce Type: new Abstract: Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer ...

arXiv cs.CL ·Naihao Deng, Yilun Zhu, Joan Nwatu, Clayton Scott, Rada Mihalcea ·
compartilhar: