The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

arXiv:2606.27510v1 Announce Type: new Abstract: Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal mediation analysis, we find that the NIE does not solely capture the causal effect through the specific component. It also contains interaction effects (INT) that measure how much...

arXiv cs.LG ·Sankaran Vaidyanathan, David Arbour, Aaron Mueller, Scott Niekum, David Jensen ·
compartilhar: