Blog LLMs & Texto

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

arXiv:2606.23961v1 Announce Type: new Abstract: Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same template, a per-step direct-attention score followed by deterministic top-$K$ selection, which converts a single below-cutoff step into an irreversible verdict and permanently erases any subtly important token that direct att...

arXiv cs.LG ·Duc Duong, Hoang Anh Duy Le, Jianwen Xie, Anshumali Shrivastava, Zhaozhuo Xu · 24 de janeiro de 2026

Ver no Hugging Face

// relacionados

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

Leia também

Europe is pushing back on Washington’s chip war

Comfy-Org/Krea-2

Cerebras stock plunges after earnings as CEO says margin outlook was misunderstood

OpenAI and Broadcom announce chip designed for LLM inference at scale