Blog LLMs & Texto Robótica & RL

Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

arXiv:2607.01240v1 Announce Type: new Abstract: Count-based F1 is widely used as a proxy for LLM error-detection quality, but this paper shows that it can rise dramatically without a corresponding improvement in span localization, a gap termed F1 Inflation. The paper introduces ErrorBench, a controlled stress-test protocol for prompt-induced count distortion. ErrorBench evaluates six contemporary LLMs under five prompt conditions over 4,290 responses from 143 CoNLL-2014 passages. Under CoNLL-201...

arXiv cs.CL ·Dekun Yang · 03 de janeiro de 2026

Ver no Hugging Face

// relacionados

Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

Leia também

O complicado problema do Claude Code com a China envolve proibições dos dois lados do Pacífico

AI Security Institute do Reino Unido descobre que benchmarks padrão subestimam sistematicamente o que agentes de IA realmente conseguem fazer

ByteDance-Seed/EdgeBench

Google DeepMind e A24 anunciam parceria de pesquisa inédita