44 resultados para "LLM"
Aprendizado por Reforço Explicado
Robótica & RLO que é aprendizado por reforço: agente, ambiente, ação e recompensa. Como a IA aprende por tentativa e erro, do AlphaGo ao RLHF dos LLMs — e onde isso falha.
Mistura de Especialistas: Modelos Gigantes que Custam Pouco
LLMs & TextoO que é um modelo MoE (mixture-of-experts): como um LLM pode ter um trilhão de parâmetros e ativar só uma fração a cada token — e por que isso domina os lançamentos atuais.
Agentes de IA: O Que São e Como Pensam
LLMs & TextoAgentes de IA explicados: como um LLM deixa de só responder e passa a usar ferramentas, planejar e agir — e por que isso é mais frágil do que parece.
VLMs: Modelos de Visão e Linguagem
MultimodalO que é um VLM (vision-language model): como ele une um codificador de imagem a um LLM, o que consegue fazer — VQA, descrição, leitura de documentos — e seus limites.
LLMs: Como Funcionam os Modelos de Linguagem
LLMs & TextoEntenda de uma vez como funcionam os LLMs: a arquitetura transformer, o treinamento, por que eles alucinam, fine-tuning, RAG, quantização e agentes.
Quantização: Rodar LLMs no Seu Computador
LLMs & TextoO que é quantização e como ela permite rodar LLMs de qualidade no seu próprio computador, com privacidade e custo zero por requisição.
Tokens Visuais: Como um Modelo 'Lê' uma Imagem
MultimodalComo um modelo multimodal transforma uma imagem em tokens que o LLM entende: patches, projeção para o espaço de texto e por que imagens custam tantos tokens.
Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity
LLMs & TextoarXiv:2606.20676v1 Announce Type: new Abstract: MLLM-as-a-Judge is conventionally validated by agreement with human annotations, but this metric is undefined when the human pool is culturally heterogeneous. We introduce VOIR DIRE, a multimodal benchmark of 626 culturally paired image--prompt artifacts spanning U.S. and mainland Chinese contexts across food, fashion, and architecture, with annotator pools that are within-pool reliable (a = 0.86/0.74) but cross-pool divergent on evaluation (Q1 r =...
Is Our Benchmark Enough? An Analysis of Continual Learning for MLLMs
LLMs & TextoarXiv:2606.20961v1 Announce Type: new Abstract: Continual adaptation is essential for multimodal large language models (MLLMs) deployed across evolving domains, but the state-of-the-art MR-LoRA method highly relies on the assumption that a MLLM-based router is necessary to process complex multimodal inputs. This paper revisits this claim on the MLLM-CL benchmark and argues for two claims. \textbf{First}, routing does not require an MLLM: a simple training-free, replay-free ptotypical routing met...
A Validation-Gated Mechanistic Account of Suicidality Detection in LLMs
LLMs & TextoarXiv:2606.21078v1 Announce Type: new Abstract: Large language models are increasingly proposed for mental-health applications such as detecting suicidal content, raising the question of what they rely on. We study this mechanistically and use it to ask a narrower question: how to make a causal claim about a model's internal features more trustworthy. Our validation-gated framework, with suicidality detection as a case study, interprets a behavior only after the model is shown to perform it: a c...
An LLM-Explainable DRL Framework for Passenger-Directed Autonomous Driving
LLMs & TextoarXiv:2606.20640v1 Announce Type: new Abstract: Autonomous vehicles offer the potential for safer and more efficient mobility, yet public trust remains limited due to the lack of transparency in their decision-making. This work addresses this issue by combining deep reinforcement learning (DRL) for adaptive driving control with large language model (LLM)-based explainability modules designed to communicate agent behavior to passengers. DRL agents were trained in simulation using a Dueling Double...
Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices
LLMs & TextoarXiv:2606.20571v1 Announce Type: new Abstract: In agent-driven question answering (QA) applications, retrieval-augmented generation (RAG) is commonly introduced to enhance the response accuracy of large language models (LLMs) by providing additional context. Due to the inherent noise in retrieval results and the coarse granularity of document-level retrieval, the retrieved context often contains substantial redundant information. In this setting, the agent prompt, consisting of the user query a...
MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents
LLMs & TextoarXiv:2606.20717v1 Announce Type: new Abstract: Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing adversarial evaluations targeting these agents frequently rely on permissive threat models and visually conspicuous artifacts. In this paper, we investigate a constrained vulnerability detection setting: a truste...
The Metanym Game: A Self-Contained, Self-Consistent LLM Peer-Community Benchmark for Structural Intelligence
LLMs & TextoarXiv:2606.21008v1 Announce Type: new Abstract: The metanym game is a competitive word game for LLMs that measures structural intelligence against established cognitive-science constructs. No content is given in advance; the contestants create all of it -- a new kind of analogy test, analogical production falsifiable sentence by sentence, with no fixed test set to leak into training (contamination-resistant by construction). In the council-of-peers benchmark, the contestants also rate each other...
Peeking Inside LLMs: Leveraging Internal Artifacts of LLMs for Enhancing Reliability in Legal Classification
LLMs & TextoarXiv:2606.20929v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly being adopted in the legal domain. However, despite their strong performance, LLMs are prone to generating incorrect or hallucinated outputs, raising serious concerns about their reliability in high-stakes domains such as law. Detecting the correctness of responses of LLM-based systems is therefore a critical challenge. In this work, we explore the potential of leveraging internal artifacts of LLM to de...
Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures
LLMs & TextoarXiv:2606.20572v1 Announce Type: new Abstract: Achieving reliable control of Large Language Models (LLMs) requires a precise, scalable understanding of how they interpret linguistic cues. We introduce a rigorous framework using Shapley values to quantify the steering effect of individual adjectives on model performance, moving beyond anecdotal heuristics to principled attribution. Applying this method to 100 adjectives across a diverse suite of models (including o3, gpt-4o-mini, phi-3, llama-3-...
SPARC: A Multi-Agent System for Electrical Circuit Question Answering
LLMs & TextoarXiv:2606.20643v1 Announce Type: new Abstract: Electrical circuit diagram QA tasks require complex mathematical reasoning, which remains challenging for multimodal LLMs. We present SPARC, a multi-agent system that answers questions over circuit diagrams by grounding reasoning in executable physics-based simulations. SPARC uses LLM agents to synthesize, execute, and analyze simulation programs, improving accuracy and reliability by design. It achieves 83% accuracy, with up to a 58% absolute impr...
CELEUS: Certifiable and Efficient LLM Evaluation via E-Processes
LLMs & TextoarXiv:2606.20820v1 Announce Type: new Abstract: Can we trust evaluation scores to capture an LLM's true real-world performance? Certifiable evaluation answers this question by providing guarantee for LLM evaluation. In particular, existing methods sequentially curate evaluation samples and keep updating confidence intervals (CIs) that cover the true performance with high probability (e.g., 95%) until some conditions are satisfied, e.g., the CI width reaches a target precision. However, existing ...
VeriBound: PAC-Bayesian Generalization Bounds for Process Reward Models Trained with Formal Verification Tools
LLMs & TextoarXiv:2606.20740v1 Announce Type: new Abstract: Process Reward Models (PRMs) provide step-level verification for Large Language Model (LLM) reasoning, yet their training data acquisition remains a bottleneck: human annotation is costly and Monte Carlo roll-out estimates are noisy. A recent approach, FOVER, trains PRMs on step-level error labels automatically annotated by formal verification tools such as Z3 and Isabelle, and empirically observes cross-task generalization from symbolic tasks to d...
Latent Personal Memory: Represent personal memory as dynamic soft prompts
LLMs & TextoarXiv:2606.20911v1 Announce Type: new Abstract: Personalizing large language models (LLMs) requires encoding long-term, user-specific behavioral patterns in a way that is computationally efficient, scalable, and compatible with a frozen base model. We present Latent Personal Memory (LPM), a scalable framework that represents user-specific history as a compact, persistent matrix of N latent slots, that are interpretable. A shared cross-attention projection network maps these slots into dynamic, i...
PeerCheck: Enhancing LLM-Generated Academic Reviews Towards Human-Level Quality
LLMs & TextoarXiv:2606.20897v1 Announce Type: new Abstract: As academic submissions grow, the traditional peer review process struggles to keep up, raising concerns about quality and fairness. A trend of using large language models (LLMs) for assistance has emerged. In this work, we take a critical step toward improving the quality of LLM-generated reviews. We propose the PeerCheck framework, which investigates LLM-human review differences (RQ1) and explores methods to improve LLM-generated review quality (...
AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents
LLMs & TextoarXiv:2606.20625v1 Announce Type: new Abstract: LLM agents are promising for alpha mining via combining financial priors, symbolic reasoning, executable factor generation, and feedback-driven refinement. Yet, they face a combinatorial search space, noisy non-stationary feedback, redundant discoveries, and overfitting risks from naively reusing past successes. To address these challenges, we propose AlphaMemo, a self-evolving alpha mining agent with Structured Search-Process Memory. Rather than m...
PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate
LLMs & TextoarXiv:2606.20621v1 Announce Type: new Abstract: Multi-agent debate improves the reliability of large language models (LLMs) through iterative peer critiques. However, fixed topologies often introduce persistent positional biases, amplify unreliable agents, and cause high sensitivity to role assignments. We introduce \textit{Permutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR)}, an inference-time protocol that dynamically reconfigures communication roles and sparse topologies across...
A Multi-Agent Audit Framework for High-Stakes Reasoning: Evaluation and Interpretability in Clinical Mental Health Screening
LLMs & TextoarXiv:2606.21123v1 Announce Type: new Abstract: High-stakes reasoning tasks necessitate transparent and verifiable workflows, yet conventional single-model large language models (LLMs) often struggle with hallucination and low interpretability under zero-shot paradigms. To address this general AI challenge, we propose a Multi-Agent Audit Framework that simulates a collaborative, multi-step verification process. We empirically validate this architecture in the sensitive domain of clinical mental ...
Post-Training Recipe, More Than Model Family, Shapes Multi-Agent LLM Conversational Behavior
LLMs & TextoarXiv:2606.20632v1 Announce Type: new Abstract: Multi-LLM systems use multiple language models to deliberate, judge each other's outputs, or coordinate as agents. Their value depends on the models producing measurably different conversational behaviors when given the same input. Prior offline studies recommend drawing one model per family for behavioral diversity, because LLMs prefer outputs from their own family when rating one another in isolation. Whether the same family label predicts behavi...
From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents
LLMs & TextoarXiv:2606.20661v1 Announce Type: new Abstract: The integration of external tools has transitioned LLM agents from passive responders to autonomous systems. However, current benchmarks prioritize execution success, neglecting self-awareness capability, the ability to discern whether a problem requires necessary external resources or can be solved via internal parametric knowledge. To address this, we introduce KAPRO (Knowing-Acting Quadrant PRObe), a framework that evaluates cognitive-behavioral...
LLM-Based Multi-Reference Evaluation for Efficient and Robust Assessment of Phrase Break Annotations
LLMs & TextoarXiv:2606.21098v1 Announce Type: new Abstract: Reliable evaluation of phrase break annotations is crucial, as subtle variations in prosodic boundaries directly affect the clarity and naturalness of speech. However, existing approaches exhibit major limitations: single-reference evaluation assumes a unique gold phrasing for an utterance despite multiple valid phrasings, while human judgment, though flexible, is labor-intensive and unscalable. To address these, we propose LLM-based Multi-Referenc...
In LLM Reasoning, there is Irrationality on top of Value Misalignment
LLMs & TextoarXiv:2606.20624v1 Announce Type: new Abstract: Significant progress has been made in aligning LLMs with target value functions. We argue that, even when an LLM has been well aligned in (post-)training, it may still fail to maximise the aligned value in reasoning. We mathematically formalise this gap as rational value risk: the utility discrepancy between a model's deployed reasoning strategy and its rational counterpart, which is defined to be the responses that maximise expected utility in the...
GEOPHYS: The Geometry of Physical Plausibility
LLMs & TextoarXiv:2606.20707v1 Announce Type: new Abstract: While humans can identify physically implausible events within milliseconds, machine learning approaches addressing the same problem are extremely slow and expensive. They either rely on external multimodal-LLM judges or require ad-hoc modifications to the training procedure. In this work, we argue that indicators of physical plausibility are implicitly captured by five geometric properties of the per-frame embeddings produced by frozen image encod...
MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning
LLMs & TextoarXiv:2606.20641v1 Announce Type: new Abstract: Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in semantic understanding and common sense reasoning, making them promising candidates for solving planning problems in autonomous driving. However, the next-token text prediction objectives traditionally used in pre-training and supervised fine-tuning (SFT) of MLLMs may fall short of fulfilling the planning objectives for autonomous vehicles. The next-token predict...
Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL
LLMs & TextoarXiv:2606.21023v1 Announce Type: new Abstract: As Large Language Models (LLMs) deploy into mission-critical domains (e.g., finance, medicine, and law), output reproducibility has become a strict system requirement. While practitioners use greedy decoding to eliminate algorithmic stochasticity, empirical deployments with 16-bit precisions still exhibit catastrophic output divergence across heterogeneous GPUs. Through SASS-level profiling, we reveal that this inconsistency is fundamentally driven...
Beyond Templates: Revisiting Zero-Shot Remote Sensing through Meta-Prompting
LLMs & TextoarXiv:2606.20702v1 Announce Type: new Abstract: Vision-language models (VLMs) have sparked growing interest in zero-shot Earth Observation (EO) downstream tasks, with further gains enabled by remote-sensing-adapted models. We examine this setting across 17 VLM variants and 12 remote sensing (RS) datasets under Meta-Prompting for Visual Recognition (MPVR), and show that zero-shot performance remains highly sensitive to textual design choices, from the meta-prompts used to guide the LLM in generat...
Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents
LLMs & TextoarXiv:2606.20631v1 Announce Type: new Abstract: Agent skills externalise reusable agent-facing behavioural knowledge and guidance as persistent artefacts that can be discovered, activated, and interpreted by LLM agents. Although a skill artefact is static at rest, its architectural responsibilities arise in use, when the artefact is selected for a run, bound to context and authority constraints, interpreted by a stochastic agent, and recorded as run evidence. We call this run-specific relation s...
Specific Domain Ontology Construction Using Large Language Models
LLMs & TextoarXiv:2606.20691v1 Announce Type: new Abstract: Ontologies are useful structures to organize and maintain information that can be understood both by humans and systems. However, since their manual crafting is a laborious task, many specific domains lack reference ontologies. The outstanding ability for understanding natural language demonstrated by the Large Language Models (LLMs) has motivated their application to aid on a variety of fields, including on ontology development. This work presents...
Event Ontology Expansion via LLM-Based Conceptualization
LLMs & TextoarXiv:2606.21048v1 Announce Type: new Abstract: Event ontology expansion aims to discover emerging event types from data and extend them to appropriate positions in the existing event ontology.. Existing methods typically cluster contextualized trigger representations and attach induced clusters to the ontology based on instance-level similarity. However, ontology expansion requires concept-level semantics that characterize event types, whereas contextualized trigger representations often confla...
Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs
LLMs & TextoFugu and Fugu Ultra route tasks across a swappable model pool, leading most coding, reasoning, and agentic benchmarks. The post Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs appeared first on MarkTechPost .
Sakana AI's Fugu orchestrates multiple LLMs to match Anthropic's Fable and Mythos benchmarks
LLMs & TextoJapanese AI startup Sakana AI is launching Fugu, a system that coordinates multiple AI models on the fly to compete with leaders like Anthropic's Fable 5. The approach also aims to cut dependence on any single AI provider. The article Sakana AI's Fugu orchestrates multiple LLMs to match Anthropic's Fable and Mythos benchmarks appeared first on The Decoder .
The 7 Types of Agent Memory: A Technical Guide for AI Engineers
LLMs & TextoLLMs are stateless by default. Agent memory fixes that. This guide breaks down all 7 types — working, semantic, episodic, procedural, retrieval, parametric, and prospective. It covers what each stores, where it lives, and when to build it. Includes a comparison table and working Python code. The post The 7 Types of Agent Memory: A Technical Guide for AI Engineers appeared first on MarkTechPost .
Kimi-K2.7-Code: o LLM aberto que mira o trabalho de quem programa
LLMs & TextoCom cerca de 363 mil downloads no Hugging Face, o modelo da Moonshot AI condensa uma tendência: modelos de linguagem abertos, especializados em código, deixando de ser curiosidade para virar ferramenta de trabalho.
PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
LLMs & TextoPlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.
Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
LLMs & TextoCisco Foundation AI has open-sourced FAPO (Fully Automated Prompt Optimization), a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to target accuracy. FAPO evaluates a chain, attributes failures at the step level, proposes variants across prompt, parameter, and chain-structure levels, and validates each through an independent reviewer. In Cisco's evaluation, it beat GEPA on 15 of 18 model-benchmark comparisons. Here's how the optimization loop...
How to Build a Forecasting Pipeline with TimeCopilot Using Foundation Models and Automated Anomaly Detection
LLMs & TextoWe build an end-to-end forecasting workflow with TimeCopilot on a panel of real airline passenger data and a synthetic seasonal series with injected anomalies. We evaluate statistical, foundation, and optional GPU-based models using rolling cross-validation and multiple error metrics. We generate probabilistic forecasts with prediction intervals, visualize future trends, and flag unusual observations. We then explore TimeCopilot's optional LLM agent, which selects a model and explains its predic...
A startup claims it broke through a bottleneck that’s holding back LLMs
LLMs & TextoThe Miami-based AI startup Subquadratic came out of stealth mode last month with a huge claim. It announced that it had solved a mathematical bottleneck that had been holding back large language models for almost a decade. The details were thin, and many people were unconvinced. But Subquadratic has started to bring the receipts, sharing…
LLMs & Texto
LLMs & TextoModelos de linguagem, agentes, raciocínio e o estado da arte em texto.