Blog LLMs & Texto

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

arXiv:2606.25519v1 Announce Type: new Abstract: Quantization is widely used to reduce the inference cost of large language models, but its effect on reasoning models is not fully captured by final-answer accuracy or per-token latency. We show that low-bit post-training quantization can introduce a hidden test-time compute cost: quantized reasoning models often generate longer chains of thought even when they still answer correctly. Across mathematical reasoning, code generation, scientific quest...

arXiv cs.AI ·Xinyu Lian, Walid Krichene, Beichen Huang, Masahiro Tanaka, Olatunji Ruwase, Li Zhang, Minjia Zhang · 25 de janeiro de 2026

Ver no Hugging Face

// relacionados

Quantization Inflates Reasoning: Token Inflation as a Hidden Cost of Low-Bit Reasoning Models

Leia também

Amazon ups India bet with fresh $13B AI infrastructure investment

Jalapeño: a OpenAI projeta seu primeiro chip de inferência — e usou IA para fazer isso em 9 meses

SkillOpt: como ensinar agentes de IA a melhorar suas próprias habilidades — +23 pontos em GPT-5.5

Authors Guild test finds some AI detectors perfectly identify human writing while others fail on every single text