Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds
arXiv:2607.00276v1 Announce Type: new Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning breaks down. We introduce an auditable four-stage diagnostic that evaluates whether an LLM can reason inside an unfamiliar physics framework through induction, formulation, prediction, and review. The diagnostic combines loc...
arXiv cs.LG
·Dong Zhang
·
// relacionados
Leia também
Editorial
Claude Sonnet 5: a Anthropic aposta que o modelo do meio faz o trabalho do topo
Blog
Google’s AI buildout drove 37% increase in electricity use in 2025
Blog
OpenAI reportedly offers the Trump administration a five percent stake in the company
Blog