Blog LLMs & Texto

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

arXiv:2606.28725v1 Announce Type: new Abstract: Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement. Existing drift detection methods often focus on global distributional change, but such signals may miss safety-relevant shifts that emerge in localized harm subspaces or high-risk model-error regions. This paper introduces DriftGuard, a safety-aware adaptive mo...

arXiv cs.CL ·Yuting Xin, Hanyu Cai, Binqi Shen, Lier Jin, Lan Hu · 30 de janeiro de 2026

Ver no Hugging Face

// relacionados

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

Leia também

nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16

OpenClaw is finally available on Android and iOS

Claude Science is Anthropic’s newest flagship product

Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared