44 resultados para "LLM"

Artigo

Aprendizado por Reforço Explicado

Robótica & RL

O que é aprendizado por reforço: agente, ambiente, ação e recompensa. Como a IA aprende por tentativa e erro, do AlphaGo ao RLHF dos LLMs — e onde isso falha.

Artigo

Mistura de Especialistas: Modelos Gigantes que Custam Pouco

LLMs & Texto

O que é um modelo MoE (mixture-of-experts): como um LLM pode ter um trilhão de parâmetros e ativar só uma fração a cada token — e por que isso domina os lançamentos atuais.

Artigo

Agentes de IA: O Que São e Como Pensam

LLMs & Texto

Agentes de IA explicados: como um LLM deixa de só responder e passa a usar ferramentas, planejar e agir — e por que isso é mais frágil do que parece.

Artigo

VLMs: Modelos de Visão e Linguagem

Multimodal

O que é um VLM (vision-language model): como ele une um codificador de imagem a um LLM, o que consegue fazer — VQA, descrição, leitura de documentos — e seus limites.

Guia

LLMs: Como Funcionam os Modelos de Linguagem

LLMs & Texto

Entenda de uma vez como funcionam os LLMs: a arquitetura transformer, o treinamento, por que eles alucinam, fine-tuning, RAG, quantização e agentes.

Artigo

Quantização: Rodar LLMs no Seu Computador

LLMs & Texto

O que é quantização e como ela permite rodar LLMs de qualidade no seu próprio computador, com privacidade e custo zero por requisição.

Artigo

Tokens Visuais: Como um Modelo 'Lê' uma Imagem

Multimodal

Como um modelo multimodal transforma uma imagem em tokens que o LLM entende: patches, projeção para o espaço de texto e por que imagens custam tantos tokens.

Notícia

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

LLMs & Texto

arXiv:2606.20676v1 Announce Type: new Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image--prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r =...

Notícia

Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs

LLMs & Texto

arXiv:2606.20961v1 Announce Type: new Abstract: Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolving domains, but the state-of-the-art MR-LoRA method highly relies on the assumption that a MLLM-based router is necessary to process complex multimodal inputs. This paper revisits this claim on the MLLM-CL benchmark and argues for two claims. \textbf{First}, routing does not require an MLLM: a simple training-free, replay-free ptotypical routing met...

Notícia

A Validation-Gated Mechanistic Account of Suicidality Detection in LLMs

LLMs & Texto

arXiv:2606.21078v1 Announce Type: new Abstract: Large language models are increasingly proposed for mental-health applications such as detecting suicidal content, raising the question of what they rely on. We study this mechanistically and use it to ask a narrower question: how to make a causal claim about a model's internal features more trustworthy. Our validation-gated framework, with suicidality detection as a case study, interprets a behavior only after the model is shown to perform it: a c...

Notícia

An LLM-Explainable DRL Framework for Passenger-Directed Autonomous Driving

LLMs & Texto

arXiv:2606.20640v1 Announce Type: new Abstract: Autonomous vehicles offer the potential for safer and more efficient mobility, yet public trust remains limited due to the lack of transparency in their decision-making. This work addresses this issue by combining deep reinforcement learning (DRL) for adaptive driving control with large language model (LLM)-based explainability modules designed to communicate agent behavior to passengers. DRL agents were trained in simulation using a Dueling Double...

Notícia

Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices

LLMs & Texto

arXiv:2606.20571v1 Announce Type: new Abstract: In agent-driven question answering (QA) applications, retrieval-augmented generation (RAG) is commonly introduced to enhance the response accuracy of large language models (LLMs) by providing additional context. Due to the inherent noise in retrieval results and the coarse granularity of document-level retrieval, the retrieved context often contains substantial redundant information. In this setting, the agent prompt, consisting of the user query a...

Notícia

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

LLMs & Texto

arXiv:2606.20717v1 Announce Type: new Abstract: Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a truste...

Notícia

The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence

LLMs & Texto

arXiv:2606.21008v1 Announce Type: new Abstract: The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it -- a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-peers benchmark, the contestants also rate each other...

Notícia

Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification

LLMs & Texto

arXiv:2606.20929v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly being adopted in the legal domain. However, despite their strong performance, LLMs are prone to generating incorrect or hallucinated outputs, raising serious concerns about their reliability in high-stakes domains such as law. Detecting the correctness of responses of LLM-based systems is therefore a critical challenge. In this work, we explore the potential of leveraging internal artifacts of LLM to de...

Notícia

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

LLMs & Texto

arXiv:2606.20572v1 Announce Type: new Abstract: Achieving reliable control of Large Language Models (LLMs) requires a precise, scalable understanding of how they interpret linguistic cues. We introduce a rigorous framework using Shapley values to quantify the steering effect of individual adjectives on model performance, moving beyond anecdotal heuristics to principled attribution. Applying this method to 100 adjectives across a diverse suite of models (including o3, gpt-4o-mini, phi-3, llama-3-...

Notícia

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

LLMs & Texto

arXiv:2606.20643v1 Announce Type: new Abstract: Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute impr...

Notícia

CELEUS: Certifiable and Efficient LLM Evaluation via E-Processes

LLMs & Texto

arXiv:2606.20820v1 Announce Type: new Abstract: Can we trust evaluation scores to capture an LLM's true real-world performance? Certifiable evaluation answers this question by providing guarantee for LLM evaluation. In particular, existing methods sequentially curate evaluation samples and keep updating confidence intervals (CIs) that cover the true performance with high probability (e.g., 95%) until some conditions are satisfied, e.g., the CI width reaches a target precision. However, existing ...

Notícia

VeriBound: PAC-Bayesian Generalization Bounds for Process Reward Models Trained with Formal Verification Tools

LLMs & Texto

arXiv:2606.20740v1 Announce Type: new Abstract: Process Reward Models (PRMs) provide step-level verification for Large Language Model (LLM) reasoning, yet their training data acquisition remains a bottleneck: human annotation is costly and Monte Carlo roll-out estimates are noisy. A recent approach, FOVER, trains PRMs on step-level error labels automatically annotated by formal verification tools such as Z3 and Isabelle, and empirically observes cross-task generalization from symbolic tasks to d...

Notícia

Latent Personal Memory: Represent personal memory as dynamic soft prompts

LLMs & Texto

arXiv:2606.20911v1 Announce Type: new Abstract: Personalizing large language models (LLMs) requires encoding long-term, user-specific behavioral patterns in a way that is computationally efficient, scalable, and compatible with a frozen base model. We present Latent Personal Memory (LPM), a scalable framework that represents user-specific history as a compact, persistent matrix of N latent slots, that are interpretable. A shared cross-attention projection network maps these slots into dynamic, i...

Notícia

PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality

LLMs & Texto

arXiv:2606.20897v1 Announce Type: new Abstract: As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness. A trend of using large language models (LLMs) for assistance has emerged. In this work, we take a critical step toward improving the quality of LLM-generated reviews. We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to improve LLM-generated review quality (...

Notícia

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

LLMs & Texto

arXiv:2606.20625v1 Announce Type: new Abstract: LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback-driven refinement. Yet, they face a combinatorial search space, noisy non-stationary feedback, redundant discoveries, and overfitting risks from naively reusing past successes. To address these challenges, we propose AlphaMemo, a self-evolving alpha mining agent with Structured Search-Process Memory. Rather than m...

Notícia

PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

LLMs & Texto

arXiv:2606.20621v1 Announce Type: new Abstract: Multi-agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce \textit{Permutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR)}, an inference-time protocol that dynamically reconfigures communication roles and sparse topologies across...

Notícia

A Multi-Agent Audit Framework for High-Stakes Reasoning: Evaluation and Interpretability in Clinical Mental Health Screening

LLMs & Texto

arXiv:2606.21123v1 Announce Type: new Abstract: High-stakes reasoning tasks necessitate transparent and verifiable workflows, yet conventional single-model large language models (LLMs) often struggle with hallucination and low interpretability under zero-shot paradigms. To address this general AI challenge, we propose a Multi-Agent Audit Framework that simulates a collaborative, multi-step verification process. We empirically validate this architecture in the sensitive domain of clinical mental ...

Notícia

Post-Training Recipe, More Than Model Family, Shapes Multi-Agent LLM Conversational Behavior

LLMs & Texto

arXiv:2606.20632v1 Announce Type: new Abstract: Multi-LLM systems use multiple language models to deliberate, judge each other's outputs, or coordinate as agents. Their value depends on the models producing measurably different conversational behaviors when given the same input. Prior offline studies recommend drawing one model per family for behavioral diversity, because LLMs prefer outputs from their own family when rating one another in isolation. Whether the same family label predicts behavi...

Notícia

From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents

LLMs & Texto

arXiv:2606.20661v1 Announce Type: new Abstract: The integration of external tools has transitioned LLM agents from passive responders to autonomous systems. However, current benchmarks prioritize execution success, neglecting self-awareness capability, the ability to discern whether a problem requires necessary external resources or can be solved via internal parametric knowledge. To address this, we introduce KAPRO (Knowing-Acting Quadrant PRObe), a framework that evaluates cognitive-behavioral...

Notícia

LLM-Based Multi-Reference Evaluation for Efficient and Robust Assessment of Phrase Break Annotations

LLMs & Texto

arXiv:2606.21098v1 Announce Type: new Abstract: Reliable evaluation of phrase break annotations is crucial, as subtle variations in prosodic boundaries directly affect the clarity and naturalness of speech. However, existing approaches exhibit major limitations: single-reference evaluation assumes a unique gold phrasing for an utterance despite multiple valid phrasings, while human judgment, though flexible, is labor-intensive and unscalable. To address these, we propose LLM-based Multi-Referenc...

Notícia

In LLM Reasoning, there is Irrationality on top of Value Misalignment

LLMs & Texto

arXiv:2606.20624v1 Announce Type: new Abstract: Significant progress has been made in aligning LLMs with target value functions. We argue that, even when an LLM has been well aligned in (post-)training, it may still fail to maximise the aligned value in reasoning. We mathematically formalise this gap as rational value risk: the utility discrepancy between a model's deployed reasoning strategy and its rational counterpart, which is defined to be the responses that maximise expected utility in the...

Notícia

GEOPHYS: The Geometry of Physical Plausibility

LLMs & Texto

arXiv:2606.20707v1 Announce Type: new Abstract: While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encod...

Notícia

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

LLMs & Texto

arXiv:2606.20641v1 Announce Type: new Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token predict...

Notícia

Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL

LLMs & Texto

arXiv:2606.21023v1 Announce Type: new Abstract: As Large Language Models (LLMs) deploy into mission-critical domains (e.g., finance, medicine, and law), output reproducibility has become a strict system requirement. While practitioners use greedy decoding to eliminate algorithmic stochasticity, empirical deployments with 16-bit precisions still exhibit catastrophic output divergence across heterogeneous GPUs. Through SASS-level profiling, we reveal that this inconsistency is fundamentally driven...

Notícia

Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting

LLMs & Texto

arXiv:2606.20702v1 Announce Type: new Abstract: Vision-language models (VLMs) have sparked growing interest in zero-shot Earth Observation (EO) downstream tasks, with further gains enabled by remote-sensing-adapted models. We examine this setting across 17 VLM variants and 12 remote sensing (RS) datasets under Meta-Prompting for Visual Recognition (MPVR), and show that zero-shot performance remains highly sensitive to textual design choices, from the meta-prompts used to guide the LLM in generat...

Notícia

Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

LLMs & Texto

arXiv:2606.20631v1 Announce Type: new Abstract: Agent skills externalise reusable agent-facing behavioural knowledge and guidance as persistent artefacts that can be discovered, activated, and interpreted by LLM agents. Although a skill artefact is static at rest, its architectural responsibilities arise in use, when the artefact is selected for a run, bound to context and authority constraints, interpreted by a stochastic agent, and recorded as run evidence. We call this run-specific relation s...

Notícia

Specific Domain Ontology Construction Using Large Language Models

LLMs & Texto

arXiv:2606.20691v1 Announce Type: new Abstract: Ontologies are useful structures to organize and maintain information that can be understood both by humans and systems. However, since their manual crafting is a laborious task, many specific domains lack reference ontologies. The outstanding ability for understanding natural language demonstrated by the Large Language Models (LLMs) has motivated their application to aid on a variety of fields, including on ontology development. This work presents...

Notícia

Event Ontology Expansion via LLM-Based Conceptualization

LLMs & Texto

arXiv:2606.21048v1 Announce Type: new Abstract: Event ontology expansion aims to discover emerging event types from data and extend them to appropriate positions in the existing event ontology.. Existing methods typically cluster contextualized trigger representations and attach induced clusters to the ontology based on instance-level similarity. However, ontology expansion requires concept-level semantics that characterize event types, whereas contextualized trigger representations often confla...

Notícia

Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs

LLMs & Texto

Fugu and Fugu Ultra route tasks across a swappable model pool, leading most coding, reasoning, and agentic benchmarks. The post Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs appeared first on MarkTechPost .

Notícia

Sakana AI's Fugu orchestrates multiple LLMs to match Anthropic's Fable and Mythos benchmarks

LLMs & Texto

Japanese AI startup Sakana AI is launching Fugu, a system that coordinates multiple AI models on the fly to compete with leaders like Anthropic's Fable 5. The approach also aims to cut dependence on any single AI provider. The article Sakana AI's Fugu orchestrates multiple LLMs to match Anthropic's Fable and Mythos benchmarks appeared first on The Decoder .

Notícia

The 7 Types of Agent Memory: A Technical Guide for AI Engineers

LLMs & Texto

LLMs are stateless by default. Agent memory fixes that. This guide breaks down all 7 types — working, semantic, episodic, procedural, retrieval, parametric, and prospective. It covers what each stores, where it lives, and when to build it. Includes a comparison table and working Python code. The post The 7 Types of Agent Memory: A Technical Guide for AI Engineers appeared first on MarkTechPost .

Notícia

Kimi-K2.7-Code: o LLM aberto que mira o trabalho de quem programa

LLMs & Texto

Com cerca de 363 mil downloads no Hugging Face, o modelo da Moonshot AI condensa uma tendência: modelos de linguagem abertos, especializados em código, deixando de ser curiosidade para virar ferramenta de trabalho.

Notícia

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

LLMs & Texto

PlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.

Notícia

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration

LLMs & Texto

Cisco Foundation AI has open-sourced FAPO (Fully Automated Prompt Optimization), a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to target accuracy. FAPO evaluates a chain, attributes failures at the step level, proposes variants across prompt, parameter, and chain-structure levels, and validates each through an independent reviewer. In Cisco's evaluation, it beat GEPA on 15 of 18 model-benchmark comparisons. Here's how the optimization loop...

Notícia

How to Build a Forecasting Pipeline with TimeCopilot Using Foundation Models and Automated Anomaly Detection

LLMs & Texto

We build an end-to-end forecasting workflow with TimeCopilot on a panel of real airline passenger data and a synthetic seasonal series with injected anomalies. We evaluate statistical, foundation, and optional GPU-based models using rolling cross-validation and multiple error metrics. We generate probabilistic forecasts with prediction intervals, visualize future trends, and flag unusual observations. We then explore TimeCopilot's optional LLM agent, which selects a model and explains its predic...

Notícia

A startup claims it broke through a bottleneck that’s holding back LLMs

LLMs & Texto

The Miami-based AI startup Subquadratic came out of stealth mode last month with a huge claim. It announced that it had solved a mathematical bottleneck that had been holding back large language models for almost a decade. The details were thin, and many people were unconvinced. But Subquadratic has started to bring the receipts, sharing…

Modalidade

LLMs & Texto

LLMs & Texto

Modelos de linguagem, agentes, raciocínio e o estado da arte em texto.