IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training
Link:
https://openreview.net/pdf/c03c96f5ac017bf32c503143f8887d1e49fbdf5e.pdf
Title:
IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training
Abstract:
Attention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing \texttt{IvRA}, a framework designed to directly train a language model's attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that \texttt{IvRA} outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of \texttt{IvRA} across various experimental settings and to shed light on the interactions among different interpretability criteria.
Citation:
Xie S, Vosoughi S, Hassanpour S. IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training. BlackboxNLP; 2024 Sept 21.