Multimodal — Radar de IA

Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic

arXiv:2606.20734v1 Announce Type: new Abstract: Open Vocabulary Action Recognition (OVAR) enables the recognition of novel actions by leveraging vision-language representations, overcoming the limitations of traditional closed-set approaches. However, achieving robust performance in real-world scenarios typically requires domain-specific fine-tuning, which is often costly and raises privacy and regulatory concerns. In this work, we propose an alternative paradigm that bypasses target-domain trai...

23.06.2026

Blog LLMs & Texto

Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

arXiv:2606.20723v1 Announce Type: new Abstract: Chronic wound assessment remains a clinically challenging task that requires accurate interpretation of wound morphology, tissue composition, vascular characteristics, and infection risk. Recent advances in Vision-Language Models (VLMs) have introduced the possibility of automated multimodal wound analysis through image understanding combined with clinical reasoning. This study evaluates the performance of several general-purpose and medically spec...

23.06.2026

Blog Multimodal

AEF-Econ: Toward Plug-and-Play Socioeconomic Foundation Embeddings from AlphaEarth for Urban Remote Sensing

arXiv:2606.20697v1 Announce Type: new Abstract: AlphaEarth Foundations (AEF) unify global remote sensing foundation embeddings through multimodal self-supervised learning, but their pretraining focuses on physical land-surface signals, limiting plug-and-play use in socioeconomic tasks. We integrate seven heterogeneous data streams across 36 Chinese cities over eight years - AEF embeddings, population, nighttime lights, remote sensing indices, points of interest (POIs), urban morphology, and cros...

23.06.2026

Blog Multimodal

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

arXiv:2606.20770v1 Announce Type: new Abstract: Current Vision-Language Models (VLMs) are celebrated for their multilingual capabilities, yet they operate under a flawed assumption: that one language corresponds to a single writing system. This overlooks billions of users of multi-script languages like Punjabi, Serbian, Hindi-Urdu, Kurdish, among many others, for whom a model's capability may be fractured by orthographic bias. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), the first b...

23.06.2026

Blog Multimodal

Evidential Fusion Network for Multimodal Survival Prediction under Missing Modalities

arXiv:2606.20757v1 Announce Type: new Abstract: Recent multimodal survival prediction models have demonstrated strong predictive performance by leveraging complementary information across modalities. However, such models generally assume data completeness and exhibit limited robustness toward missing modalities, which are frequently encountered in real-world clinical settings. We propose the Evidential Missing Modality Survival Fusion (EMMS) model for multimodal survival prediction under missing...

23.06.2026

Blog LLMs & Texto

GEOPHYS: The Geometry of Physical Plausibility

arXiv:2606.20707v1 Announce Type: new Abstract: While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encod...

23.06.2026

Blog LLMs & Texto

Video2Code: Generating Interactive Webpages from UI Videos via Action-Aware Revisit

arXiv:2606.20711v1 Announce Type: new Abstract: UI videos provide a natural input for generating interactive webpages, as they capture both webpage appearance and action-triggered state transitions. However, directly applying video-capable vision-language models to this task remains insufficient. Existing models typically rely on sparse sampling or compressed temporal representations, which may miss short action boundaries and break the state-action-state transitions needed to implement webpage ...

23.06.2026

Blog LLMs & Texto

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

arXiv:2606.20641v1 Announce Type: new Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token predict...

23.06.2026

Blog LLMs & Texto

Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting

arXiv:2606.20702v1 Announce Type: new Abstract: Vision-language models (VLMs) have sparked growing interest in zero-shot Earth Observation (EO) downstream tasks, with further gains enabled by remote-sensing-adapted models. We examine this setting across 17 VLM variants and 12 remote sensing (RS) datasets under Meta-Prompting for Visual Recognition (MPVR), and show that zero-shot performance remains highly sensitive to textual design choices, from the meta-prompts used to guide the LLM in generat...

23.06.2026

Blog Robótica & RL

SafeDojo: Safe Reinforcement Learning for VLA via Interactive World Model

arXiv:2606.20698v1 Announce Type: new Abstract: Safe control is a prerequisite for real-world embodied intelligence, for which safe reinforcement learning has emerged as a promising paradigm. However, existing safe reinforcement learning methods either require costly real-world exploration or depend on hand-crafted safety functions. Neither scales to vision-language-action models deployed in open-world physical environments. We propose SafeDojo, the first model-based safe reinforcement learning ...

23.06.2026

Editorial Multimodal

MiniMax M3: o modelo multimodal open-weight que opera por 24 horas sem parar

Com 428 bilhões de parâmetros, contexto de 1 milhão de tokens e suporte nativo a imagem e vídeo, o M3 demonstrou otimizar kernels de GPU por 24 horas contínuas — elevando a utilização de 7,6% para 71,3%.

22.06.2026

Modelo Multimodal

baidu/Unlimited-OCR

Modelo de visão e linguagem em alta no Hugging Face — 47 downloads e 133 curtidas da comunidade.

22.06.2026 ·↓ 47