Blog LLMs & Texto Dados & Embeddings

PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

arXiv:2607.00436v1 Announce Type: new Abstract: Large language model agents are increasingly connected to scientific software, yet it remains unclear when tool access makes scientific computation more reliable rather than merely more complex. We introduce PHREEQC-MCQ-200, a benchmark for evaluating tool-augmented agents on deterministic aqueous-geochemistry simulations. The benchmark contains 200 multiple-choice questions derived from 21 validated PHREEQC scenarios, requiring agents to construct...

arXiv cs.AI ·Ke Zhang, Sahchit Chundur, Mohammad Javad Qomi, Maziar Raissi · 02 de janeiro de 2026

Ver no Hugging Face

// relacionados

PHREEQC-MCQ-200: A Diagnostic Benchmark for Tool-Augmented Scientific Simulator Agents

Leia também

Claude Sonnet 5: a Anthropic aposta que o modelo do meio faz o trabalho do topo

Google’s AI buildout drove 37% increase in electricity use in 2025

OpenAI reportedly offers the Trump administration a five percent stake in the company

The Google Health API Got a CLI: ghealth is an Open-Source Tool for Your Fitbit Air Data