Blog Áudio & Voz Dados & Embeddings

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

arXiv:2606.25225v1 Announce Type: new Abstract: Self-supervised learning from large-scale video data has emerged as a dominant paradigm for visual representation learning. Since audio and visual streams naturally co-occur in video data, extending this success to jointly learn from both modalities is a natural next step, yet it remains challenging. Existing audio-visual self-supervised methods rely on modality-specific encoders and complex combinations of contrastive or reconstruction objectives,...

arXiv cs.CV ·Revant Teotia, Adrien Bardes, Michael Rabbat, Sumit Chopra, Matthew J. Muckley, Nicolas Ballas · 25 de janeiro de 2026

Ver no Hugging Face

// relacionados

MJEPA: A Simple and Scalable Joint-Embedding Predictive Architecture for Audio-Visual Learning

Leia também

LTX-2: o primeiro modelo fundacional de vídeo e áudio em conjunto — aberto, com 19B de parâmetros

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Graph-Based Phonetic Error Correction of Noisy ASR

Noise-Aware Boundary-Enhanced Generative Learning for Ultrasound Speckle Reduction