EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction

arXiv:2606.27550v1 Announce Type: new Abstract: Multi-token prediction has been shown to increase data density during training, improve downstream text-generation quality, and serves as the defacto approach for self-speculative decoding. Existing foundation and open source models that use MTP heads commit to a static tree-based attention topology throughout the entire generation sequence, meaning the speculation depth, and thus the compute required during verification, stays constant regardless ...

arXiv cs.CL ·Carrie Chen ·
compartilhar: