When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents
arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause r...
arXiv cs.CL
·Tianyu Ding, Juan Pablo De la Cruz Weinstein
·
// relacionados
Leia também
Blog
Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency
Blog
How to Design an OpenHarness Style Agent Runtime with Tools, Memory, Permissions, Skills, and Multi-Agent Coordination
Blog
Snowflake CEO finds GLM-5.2 competitive with Opus 4.7 at a fraction of the cost
Blog