Comparing Transformers and Hybrid Models at the Token Level

arXiv:2606.20936v1 Announce Type: new Abstract: Hybrid language models that mix attention and recurrent layers have shown promise: theoretically, recurrent layers ameliorate the limitations of pure transformers on state tracking, and empirically, hybrids can outperform pure transformers in loss and downstream evaluations \citep{waleffe2024empirical,merrill2026olmohybrid}. Yet it remains unclear which data or capabilities drive these gains, and to what degree they reflect the theoretical advantag...

arXiv cs.CL ·Yanhong Li, William Merrill ·
compartilhar: