Blog LLMs & Texto

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

arXiv:2607.00572v1 Announce Type: new Abstract: Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions. We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness d...

arXiv cs.AI ·Shei Pern Chua, Fangzhao Wu · 02 de janeiro de 2026

Ver no Hugging Face

// relacionados

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Leia também

Claude Sonnet 5: a Anthropic aposta que o modelo do meio faz o trabalho do topo

Google’s AI buildout drove 37% increase in electricity use in 2025

OpenAI reportedly offers the Trump administration a five percent stake in the company

The Google Health API Got a CLI: ghealth is an Open-Source Tool for Your Fitbit Air Data