Blog LLMs & Texto

Scaling Trends for Lie Detector Oversight in Preference Learning

arXiv:2607.01567v1 Announce Type: new Abstract: Deceptive behavior in LLMs is costly to monitor and prevent, motivating approaches such as Scalable Oversight via Lie Detectors (SOLiD) (Cundy & Gleave, 2025), which uses lie detectors to identify responses for review by high-cost labelers. In this paper, we scale SOLiD to larger models and evaluate it in more diverse and realistic preference-learning settings. We find favorable scaling: undetected deception drops from 34% for 1B-parameter models t...

arXiv cs.AI ·Oskar J. Hollinsworth, Ann-Kathrin Dombrowski, Sam Adam-Day, Adam Gleave, Chris Cundy · 03 de janeiro de 2026

Ver no Hugging Face

// relacionados

Scaling Trends for Lie Detector Oversight in Preference Learning

Leia também

O complicado problema do Claude Code com a China envolve proibições dos dois lados do Pacífico

AI Security Institute do Reino Unido descobre que benchmarks padrão subestimam sistematicamente o que agentes de IA realmente conseguem fazer

ByteDance-Seed/EdgeBench

Google DeepMind e A24 anunciam parceria de pesquisa inédita