Do Thinking Tokens Help with Safety?
Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final respo…
Hugging Face · Daily Papers
·Narutatsu Ri, Abhishek Panigrahi
·
·▲ 1 upvotes
Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.
Autores: Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora
- 1 upvotes da comunidade
- Temas: thinking tokens, instruction-tuned counterparts, alignment, safety, reasoning models, hidden representation
Resumo
Resumo original (em inglês), extraído do paper:
Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation signals.Onde ler
// relacionados