DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation

arXiv:2606.20663v1 Announce Type: new Abstract: Large Language Models have the potential to expand and improve the access to clinical information by enabling new ways of interacting with medical knowledge in natural language. However, their deployment in medical question-answering settings is safety-critical, since misaligned outputs can lead to severe patient harm. AI control is an emerging approach that introduces external safeguards to mitigate unsafe behaviours in misaligned systems and has ...

arXiv cs.AI ·Guido Freire, Agust\'in Mart\'inez-Su\~n\'e, Viviana Cotik ·
compartilhar: