Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations

arXiv:2606.21082v1 Announce Type: new Abstract: Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classification problem and introduce an efficient hierarchical detector that avoids expensive long-context concatenation while retaining cross-turn reasoning. The model encodes individual turns to form compact turn representa...

arXiv cs.CL ·Chenhui Hu, Muhammed Salih, Sudipto Guha, Subramanian Srinivasan ·
compartilhar: