Blog LLMs & Texto

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

arXiv:2606.27510v1 Announce Type: new Abstract: Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal mediation analysis, we find that the NIE does not solely capture the causal effect through the specific component. It also contains interaction effects (INT) that measure how much...

arXiv cs.LG ·Sankaran Vaidyanathan, David Arbour, Aaron Mueller, Scott Niekum, David Jensen · 29 de janeiro de 2026

Ver no Hugging Face

// relacionados

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

Leia também

The US military used AI to pick thousands of targets but missed a note saying one was a school

HP accelerates enterprise workflows with OpenAI Frontier

O fantasma do Fable 5: banido, o modelo vive nos datasets que o destilam

MultiHashFormer: e se cada palavra fosse uma impressão digital?