DOPD: Dual On-policy Distillation

DOPD: Dual On-policy Distillation

DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabil…

Hugging Face · Daily Papers ·Xinlei Yu, Gen Li · ·▲ 75 upvotes

Este artigo está em destaque na seleção diária de papers do Hugging Face, curada pela comunidade de pesquisa em IA.

Autores: Xinlei Yu, Gen Li, Qingyi Si, Guibin Zhang, Yuqi Xu, Congcong Wang

  • 75 upvotes da comunidade
  • Temas: on-policy distillation, token-level signals, privileged information, privilege illusion, advantage-aware dual distillation, dynamic routing

Resumo

Resumo original (em inglês), extraído do paper:

DOPD addresses privilege illusion in on-policy distillation by dynamically routing token-level supervision between teacher and student policies based on advantage gaps and probabilities, improving capability transfer in large and vision-language models.

Onde ler

compartilhar: