Paper LLMs & Texto Robótica & RL

Do Thinking Tokens Help with Safety?

Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final respo…

Hugging Face · Daily Papers ·Narutatsu Ri, Abhishek Panigrahi · 23 de janeiro de 2026 ·▲ 1 upvotes

Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.

Autores: Narutatsu Ri, Abhishek Panigrahi, Sanjeev Arora

1 upvotes da comunidade
Temas: thinking tokens, instruction-tuned counterparts, alignment, safety, reasoning models, hidden representation

Resumo

Resumo original (em inglês), extraído do paper:

Research reveals that reasoning models' safety outcomes are predictable from early hidden representations, with deliberation appearing but not substantially influencing final responses, and current safety interventions inadvertently suppress genuine deliberation signals.

Onde ler

Ver no Hugging Face

// relacionados

Do Thinking Tokens Help with Safety?

Resumo

Onde ler

Leia também

Meddies/meddies-persona-vie

SoftBank’s CEO isn’t the only one with questions about Elon Musk’s orbital data center hype

Anthropic's Fable 5 could return within days as Trump administration prepares to lift restrictions

Apple Vision Pro exec is reportedly leaving for OpenAI