source: arxiv artificial intelligence: a definition of good explanations and the challenges explaining llm outputs

level: research

defining a good explanation has been a long-standing philosophical question, now urgent for ai systems. this paper proposes a definition based on counterfactual explanations, but adds a key twist: the explanation must account for what the listener already believes about each fact offered. the idea is that an explanation is good if it would change the listener's mind in a relevant way, given their existing knowledge.

the authors argue that most ai explainability methods ignore the human receiver's prior beliefs. they show that for an explanation to be effective, it must bridge the gap between the system's reasoning and the user's understanding. this means simply providing feature importance or decision rules is not enough; the explanation must be tailored to what the user finds plausible or surprising.

the challenge is especially acute for large language models. llm outputs are hard to explain because they rely on vast, opaque knowledge and generate text that can seem coherent but lack clear causal structure. the paper explores why standard counterfactual approaches fail here, since changing a few input tokens rarely yields a meaningful alternative output that aligns with a user's expectations. the work calls for explanation methods that actively model the user's beliefs.

why it matters: for ai and data science, this reframes explainability as a user-centered problem, pushing developers to build tools that adapt to what people already know, which is critical for trust and adoption in high-stakes domains.


source: arxiv artificial intelligence: a definition of good explanations and the challenges explaining llm outputs