Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

arXiv:2607.01490v1 Announce Type: new Abstract: Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage...

arXiv cs.LG ·Juliette Decugis, Sean O'Brien, Francis Bach, Gabriel Synnaeve, Taco Cohen ·
compartilhar: