AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

arXiv:2606.30811v1 Announce Type: new Abstract: Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources fo...

arXiv cs.CV ·Kien T. Pham, I Chieh Chen, Qifeng Chen, Long Chen ·
compartilhar: