RoPE-Aware Bit Allocation for KV-Cache Quantization

RoPE-Aware Bit Allocation for KV-Cache Quantization

Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and…

Hugging Face · Daily Papers ·Fengfeng Liang, Yuechen Zhang · ·▲ 4 upvotes

Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.

Autores: Fengfeng Liang, Yuechen Zhang, Jiaya Jia

  • 4 upvotes da comunidade
  • Temas: RoPE, KV-cache quantization, bit-allocation, TurboQuant-MSE, TQ-MSE, attention logit

Resumo

Resumo original (em inglês), extraído do paper:

Block-GTQ introduces a RoPE-aware bit allocation method for key-cache quantization that improves attention accuracy and downstream performance through adaptive bit distribution and packed cache serving.

Ler o paper completo no Hugging Face →

compartilhar: