Blog
LLMs & Texto
Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression
arXiv:2607.01237v1 Announce Type: new Abstract: Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in ex...
arXiv cs.CL
·Shen Han, Yuyang Wu
·
// relacionados
Leia também
Blog
O complicado problema do Claude Code com a China envolve proibições dos dois lados do Pacífico
Blog
AI Security Institute do Reino Unido descobre que benchmarks padrão subestimam sistematicamente o que agentes de IA realmente conseguem fazer
Dataset
ByteDance-Seed/EdgeBench
Blog