MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning
arXiv:2606.25225v1 Announce Type: new Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives,...
arXiv cs.CV
·Revant Teotia, Adrien Bardes, Michael Rabbat, Sumit Chopra, Matthew J. Muckley, Nicolas Ballas
·
// relacionados
Leia também
Editorial
LTX-2: o primeiro modelo fundacional de vídeo e áudio em conjunto — aberto, com 19B de parâmetros
Blog
How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring
Blog
Graph-Based Phonetic Error Correction of Noisy ASR
Blog