TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents
TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant p…
Hugging Face · Daily Papers
·Shoufa Chen, Luyuan Wang
·
·▲ 44 upvotes
Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.
Autores: Shoufa Chen, Luyuan Wang, Xuan Yang, Zhiheng Liu, Yuren Cong, Yuanfeng Ji
- 44 upvotes da comunidade
- Temas: terminal-use agents, general-purpose agents, computer-use tasks, graphical user interfaces, shell-based workflows, execution-based scoring protocol
Resumo
Resumo original (em inglês), extraído do paper:
TUA-Bench presents a comprehensive benchmark for evaluating general-purpose terminal-use agents across diverse digital activities and specialized workflows, revealing significant performance gaps among current frontier agents.Onde ler
// relacionados
Leia também
Modelo
nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16
Blog
OpenClaw is finally available on Android and iOS
Blog
Claude Science is Anthropic’s newest flagship product
Blog