Busca: multimodal | Ponto Zero

Guia

IA Multimodal: Modelos que Veem, Leem e Ouvem

Multimodal

Entenda a IA multimodal: como um único modelo une texto, imagem, áudio e vídeo, por que isso importa, do CLIP aos VLMs e modelos any-to-any, usos reais e limites.

Artigo

Tokens Visuais: Como um Modelo 'Lê' uma Imagem

Multimodal

Como um modelo multimodal transforma uma imagem em tokens que o LLM entende: patches, projeção para o espaço de texto e por que imagens custam tantos tokens.

Notícia

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

LLMs & Texto

arXiv:2606.20676v1 Announce Type: new Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image--prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r =...

Notícia

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

LLMs & Texto

arXiv:2606.20961v1 Announce Type: new Abstract: Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolving domains, but the state-of-the-art MR-LoRA method highly relies on the assumption that a MLLM-based router is necessary to process complex multimodal inputs. This paper revisits this claim on the MLLM-CL benchmark and argues for two claims. \textbf{First}, routing does not require an MLLM: a simple training-free, replay-free ptotypical routing met...

Notícia

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding

LLMs & Texto

arXiv:2606.20873v1 Announce Type: new Abstract: Scientific discovery increasingly relies on automated systems that generate hypotheses, inspect multimodal evidence, and validate claims at scale. Yet scientific claim verification is not well served by asking a vision-language model for a direct binary judgment: claims often combine numerical results, comparisons, scope qualifiers, and explanatory context, while evidence is encoded in tables and figures with distinct grounding structures. We prese...

Notícia

UNSEEN: Uncertainty-aware Navigation via Sparse Estimation in Unknown Environments

Robótica & RL

arXiv:2606.20755v1 Announce Type: new Abstract: Visual navigation in unknown environments remains a core challenge in mobile robotics, especially for resource-constrained platforms. Most existing approaches rely on loosely coupled modular pipelines and strong assumptions on perception quality or environmental structure, often resorting to multi-modal sensor suites that increase system complexity and deployment cost. Vision-only navigation offers a lightweight alternative, but its performance deg...

Notícia

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

LLMs & Texto

arXiv:2606.20717v1 Announce Type: new Abstract: Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a truste...

Notícia

MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data

LLMs & Texto

arXiv:2606.20696v1 Announce Type: new Abstract: Decoding inner speech from non-invasive brain signals remains a fundamental challenge due to the absence of overt linguistic output, limited training data, and large inter-subject variability. Existing brain-to-text approaches often rely on task-specific decoder fine-tuning, which restricts scalability and complicates adaptation to new participants. We propose MindAlign, a decoupled two-stage brain-to-language framework that enables open-ended text...

Notícia

A UAV-Based Multi-Modal Vision System for Automated Sideslope Deformation Monitoring and Hazard Detection

LLMs & Texto

arXiv:2606.20681v1 Announce Type: new Abstract: Slope hazards constitute a major safety threat to expressway infrastructure, and their evolution is typically manifested as slow surface deformation. Conventional manual inspection suffers from low efficiency and inadequate operational safety, especially on severely deteriorated slopes. Accordingly, there is an urgent need for an automated, high-precision solution capable of large-area slope observation and analysis. This study aims to develop a hi...

Notícia

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

LLMs & Texto

arXiv:2606.20643v1 Announce Type: new Abstract: Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute impr...

Notícia

An approach with Visual and Tabular Mamba to multimodal medical data using Mixed Fusion

Multimodal

arXiv:2606.20738v1 Announce Type: new Abstract: This article presents a complementary approach for integrating multimodal medical data in cancer classification, based on state space models represented by the Mamba architecture. To this end, a mixed multimodal fusion architecture, called Mixed Fusion, was employed and developed to enhance the interpretability of the decision-making process. The proposed approach explores two variants of Mamba: one dedicated to visual processing, responsible for c...

Notícia

Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

LLMs & Texto

arXiv:2606.20723v1 Announce Type: new Abstract: Chronic wound assessment remains a clinically challenging task that requires accurate interpretation of wound morphology, tissue composition, vascular characteristics, and infection risk. Recent advances in Vision-Language Models (VLMs) have introduced the possibility of automated multimodal wound analysis through image understanding combined with clinical reasoning. This study evaluates the performance of several general-purpose and medically spec...

Notícia

AEF-Econ: Toward Plug-and-Play Socioeconomic Foundation Embeddings from AlphaEarth for Urban Remote Sensing

Multimodal

arXiv:2606.20697v1 Announce Type: new Abstract: AlphaEarth Foundations (AEF) unify global remote sensing foundation embeddings through multimodal self-supervised learning, but their pretraining focuses on physical land-surface signals, limiting plug-and-play use in socioeconomic tasks. We integrate seven heterogeneous data streams across 36 Chinese cities over eight years - AEF embeddings, population, nighttime lights, remote sensing indices, points of interest (POIs), urban morphology, and cros...

Notícia

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

Multimodal

arXiv:2606.20770v1 Announce Type: new Abstract: Current Vision-Language Models (VLMs) are celebrated for their multilingual capabilities, yet they operate under a flawed assumption: that one language corresponds to a single writing system. This overlooks billions of users of multi-script languages like Punjabi, Serbian, Hindi-Urdu, Kurdish, among many others, for whom a model's capability may be fractured by orthographic bias. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), the first b...

Notícia

Evidential Fusion Network for Multimodal Survival Prediction under Missing Modalities

Multimodal

arXiv:2606.20757v1 Announce Type: new Abstract: Recent multimodal survival prediction models have demonstrated strong predictive performance by leveraging complementary information across modalities. However, such models generally assume data completeness and exhibit limited robustness toward missing modalities, which are frequently encountered in real-world clinical settings. We propose the Evidential Missing Modality Survival Fusion (EMMS) model for multimodal survival prediction under missing...

Notícia

GEOPHYS: The Geometry of Physical Plausibility

LLMs & Texto

arXiv:2606.20707v1 Announce Type: new Abstract: While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encod...

Notícia

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

LLMs & Texto

arXiv:2606.20641v1 Announce Type: new Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token predict...

Notícia

MiniMax M3: o modelo multimodal open-weight que opera por 24 horas sem parar

Multimodal

Com 428 bilhões de parâmetros, contexto de 1 milhão de tokens e suporte nativo a imagem e vídeo, o M3 demonstrou otimizar kernels de GPU por 24 horas contínuas — elevando a utilização de 7,6% para 71,3%.

Notícia

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Multimodal

BioMatrix is a novel multimodal foundation model that integrates molecular sequences, structures, and natural language into a unified decoder-only architecture for diverse biologic…

Modalidade

Multimodal

Modelos que cruzam texto, imagem, áudio e vídeo no mesmo raciocínio.

20 resultados para "multimodal"

IA Multimodal: Modelos que Veem, Leem e Ouvem

Tokens Visuais: Como um Modelo 'Lê' uma Imagem

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding

UNSEEN: Uncertainty-aware Navigation via Sparse Estimation in Unknown Environments

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

MindAlign: Decoding Inner Speech from fMRI Signals via Multimodal Embedding Alignment under Limited Data

A UAV-Based Multi-Modal Vision System for Automated Sideslope Deformation Monitoring and Hazard Detection

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

An approach with Visual and Tabular Mamba to multimodal medical data using Mixed Fusion

Evaluation of Medical Vision Language Models HuluMed and MedGemma, and general purpose chatbots Gemma 3, ChatGPT Plus, and Claude Pro on real previously unseen wound images

AEF-Econ: Toward Plug-and-Play Socioeconomic Foundation Embeddings from AlphaEarth for Urban Remote Sensing

Beyond 'One Language, One Script': Quantifying Orthographic Bias in Multilingual VLMs with PuMVR

Evidential Fusion Network for Multimodal Survival Prediction under Missing Modalities

GEOPHYS: The Geometry of Physical Plausibility

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

MiniMax M3: o modelo multimodal open-weight que opera por 24 horas sem parar

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Multimodal