Kara: Efficient Reasoning LLM Serving via Sliding-Window KV Cache Compression

arXiv:2607.01237v1 Announce Type: new Abstract: Reasoning language models often generate long chain-of-thought (CoT), which accumulates a massive KV cache during the decoding phase and incurs high decoding latency and limited throughput. To address these issues, KV cache compression has emerged as a promising technique for reducing memory overhead by selectively removing unimportant KV pairs while preserving useful ones for subsequent decoding. Nevertheless, we identify two key limitations in ex...

arXiv cs.CL ·Shen Han, Yuyang Wu ·
compartilhar: