Do Thinking Tokens Help with Safety?

Do Thinking Tokens Help with Safety?

Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final respo…

Hugging Face · Daily Papers ·Narutatsu Ri, Abhishek Panigrahi · ·▲ 1 upvotes

Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.

Autores: Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora

  • 1 upvotes da comunidade
  • Temas: thinking tokens, instruction-tuned counterparts, alignment, safety, reasoning models, hidden representation

Resumo

Resumo original (em inglês), extraído do paper:

Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation signals.

Onde ler

compartilhar: