OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task…

Hugging Face · Daily Papers ·Mengqi Yuan, Zilong Zhou · ·▲ 15 upvotes

Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.

Autores: Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song

  • 15 upvotes da comunidade
  • Temas: computer-use workflows, long-horizon tasks, agent-pattern challenges, cross-source reasoning, implicit-state inference, visual-spatial precision

Resumo

Resumo original (em inglês), extraído do paper:

OSWorld 2.0 presents a comprehensive benchmark for evaluating computer-use agents through complex, real-world workflows that reveal current limitations in agent reasoning and task completion.

Onde ler

compartilhar: