CELEUS: Certifiable and Efficient LLM Evaluation via E-Processes

arXiv:2606.20820v1 Announce Type: new Abstract: Can we trust evaluation scores to capture an LLM's true real-world performance? Certifiable evaluation answers this question by providing guarantee for LLM evaluation. In particular, existing methods sequentially curate evaluation samples and keep updating confidence intervals (CIs) that cover the true performance with high probability (e.g., 95%) until some conditions are satisfied, e.g., the CI width reaches a target precision. However, existing ...

arXiv cs.LG ·Zhijian Zhou, Zesheng Ye, Zhaorun Chen, Bo Li, Feng Liu ·
compartilhar: