Multimodal — Radar de IA

DiffusionGemma: o Google aplica difusão à escrita de texto

Em vez de produzir uma palavra por vez, o novo modelo aberto do Google revela blocos inteiros de texto de uma só vez — e passa de 1.100 tokens por segundo. O preço é uma troca explícita de precisão por velocidade.

23.06.2026

Modelo Multimodal

HauhauCS/Gemma4-12B-QAT-Uncensored-HauhauCS-Balanced

Modelo de visão e linguagem em alta no Hugging Face — 2.1 mil downloads e 45 curtidas da comunidade.

23.06.2026 ·↓ 2056

Blog LLMs & Texto

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

arXiv:2606.20676v1 Announce Type: new Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image--prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r =...

23.06.2026

Blog LLMs & Texto

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

arXiv:2606.20961v1 Announce Type: new Abstract: Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolving domains, but the state-of-the-art MR-LoRA method highly relies on the assumption that a MLLM-based router is necessary to process complex multimodal inputs. This paper revisits this claim on the MLLM-CL benchmark and argues for two claims. \textbf{First}, routing does not require an MLLM: a simple training-free, replay-free ptotypical routing met...

23.06.2026

Blog LLMs & Texto

SciLens: Multi-modal Scientific Claim Verification with Agentic Entailment and Grounding

arXiv:2606.20873v1 Announce Type: new Abstract: Scientific discovery increasingly relies on automated systems that generate hypotheses, inspect multimodal evidence, and validate claims at scale. Yet scientific claim verification is not well served by asking a vision-language model for a direct binary judgment: claims often combine numerical results, comparisons, scope qualifiers, and explanatory context, while evidence is encoded in tables and figures with distinct grounding structures. We prese...

23.06.2026

Blog LLMs & Texto

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

arXiv:2606.20717v1 Announce Type: new Abstract: Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a truste...

23.06.2026

Blog Robótica & RL

Perturbation-Based Uncertainty for Failure Detection in Vision-Language-Action Models

arXiv:2606.20754v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but reliable uncertainty quantification remains challenging, particularly under distribution shift. Unlike autoregressive policies, many modern VLA models generate continuous actions through regression or flow-based generation, where explicit predictive probabilities are unavailable. Moreover, existing approaches often rely on stochastic action sampling or su...

23.06.2026

Blog Multimodal

How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding

arXiv:2606.20726v1 Announce Type: new Abstract: We introduce a compact empirical model that quantifies how answer accuracy degrades as a function of frame budget B and temporal distance D in long video understanding -- analyzing performance when recalling content from D seconds in the past using a fraction B of total frames. Long-form models operate under strict budgets, yet no prior framework predicts how accuracy degrades as B shrinks and events recede. We fit a weighted least-squares model on...

23.06.2026

Blog Robótica & RL

Latent Goal Prediction from Language for Model-Based Planning

arXiv:2606.20627v1 Announce Type: new Abstract: Planning with world models is bottlenecked by compounding prediction errors and the difficulty of defining optimizable goals. Visual targets provide precise local gradients but poor distant guidance, while language is flexible yet limited by noisy cross-modal alignment or dependence on large generative models unsuited for the high-sampling nature of model-based planning. To address these challenges, we introduce Latent Goal Prediction from Language...

23.06.2026

Blog LLMs & Texto

REKEY: Metadata-Grounded Visual-Key Regeneration for Contamination-Resilient VQA Evaluation

arXiv:2606.20736v1 Announce Type: new Abstract: Static visual question answering (VQA) benchmarks age quickly: Once the items leak into training corpora, scores can reflect memorization rather than genuine visual ability, thus obscuring real progress. Rebuilding high-quality benchmarks such as V*Bench requires substantial human annotation, yet each static release can quickly become another leaked artifact. We propose ReKey, a live benchmark protocol that randomly regenerates the answer-bearing l...

23.06.2026

Blog LLMs & Texto

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

arXiv:2606.20643v1 Announce Type: new Abstract: Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute impr...

23.06.2026

Blog Multimodal

An approach with Visual and Tabular Mamba to multimodal medical data using Mixed Fusion

arXiv:2606.20738v1 Announce Type: new Abstract: This article presents a complementary approach for integrating multimodal medical data in cancer classification, based on state space models represented by the Mamba architecture. To this end, a mixed multimodal fusion architecture, called Mixed Fusion, was employed and developed to enhance the interpretability of the decision-making process. The proposed approach explores two variants of Mamba: one dedicated to visual processing, responsible for c...

23.06.2026