5.2. Can You Trace the Model’s Thinking—or Just Its Output?

5.2 Can You Trace the Model’s Thinking, or Just Its Output?**¶

💬 “People don’t trust what they don’t understand, and they can’t contest what they can’t see.”
, Sandra Wachter, Professor of Technology and Regulation, University of Oxford¹

When AI systems present their outputs, they rarely present their reasoning. And when they do, the reasoning is often a performance, not a reflection of what actually happened inside the model.

In 2024, a Canadian court ruled that Air Canada was responsible for misleading a passenger through an AI chatbot that fabricated a refund policy. The chatbot’s response sounded articulate, plausible , and entirely false. There was no warning. No signal of doubt. No way for the user to see how the answer was produced. The failure wasn’t just inaccuracy. It was persuasion, delivered with confidence where caution was needed².

At first glance, this might seem like a simple technical fault , an error in policy retrieval or chatbot logic. But what made it truly dangerous was the way it disguised its failure: fluent, confident, and unchallengeable. The case revealed a deeper systemic risk: modern AI persuades by appearance, not by truth. And users trust what looks certain, even when no evidence lies beneath.

An unfaithful explanation is more dangerous than no explanation. It masks the system’s failure behind the illusion of clarity.

🧠 The Crisis of Surface-Level Interpretability¶

AI developers have made tremendous progress in creating tools that seem to explain decisions. And in many cases, these tools do just enough to satisfy users’ expectations of clarity. But this is exactly what makes them dangerous. When a system highlights what looks plausible rather than what caused the decision, it offers comfort, not truth , and turns transparency into illusion.

In practice, users are shown what they expect to see, not what the model actually computed. This isn’t transparency, it’s (1) theatrical alignment.

The phenomenon where AI outputs explanations that match user expectations rather than actual model reasoning.

Recent interpretability research from Anthropic shows that Claude 3.5, for instance, can offer a detailed step-by-step explanation of how it solved a math problem, even when the internal traces reveal that no such reasoning took place³. The explanation is reverse-engineered to match the output, a rationalization, not a rationale.

This pattern also appears in legal LLMs like DoNotPay, which justified fake case citations⁴, and in medical models that recommend procedures based on confounding correlations, while attributing their choices to “common symptoms” or “clinical norms.”

In all these cases, the danger isn’t that the model is opaque, it’s that it pretends to be transparent.

🧭 Why This Matters for Trustworthy AI¶

We often ask for explanations to increase trust. But if the explanation is unreliable, it does the opposite. It creates epistemic risk, a condition where users believe they understand a system, and make decisions based on that belief, when in fact they’re being misled.

This makes interpretability not just a UX concern, but a risk management challenge. As outlined in Chapter 3, false confidence in opaque logic is one of the hardest risks to detect, and one of the most dangerous to ignore.

In this section, we explore two connected failures:

When interpretability tools like SHAP and LIME offer the illusion of insight, while hiding causal gaps
When users need to understand failure modes, but get generic justifications instead

We’ll also examine how new techniques, like Claude’s token-level attribution graphs, signal a potential shift toward causal traceability over (1) surface transparency³. But we’ll see that even these newer tools must be used carefully.

A condition where AI systems appear to provide explanations or reasoning, but do not expose genuine internal logic.

Because if a model’s explanation isn’t faithful to what actually happened, then what exactly are we trusting?

Bibliography¶

Wachter, S., Mittelstadt, B., & Russell, C. (2021). Why fairness cannot be automated: Bridging the gap between EU non-discrimination law and AI. Computer Law & Security Review, 41, 105567. ↩
Anosha Khan , LAW360 Canada, 2024. Civil Resolution Tribunal of British Columbia. https://www.law360.ca/ca/articles/1804075/court-rejects-air-canada-s-remarkable-denial-of-liability-regarding-misinformation-by-its-chatbot ↩
Anthropic. (2025). On the Biology of a Large Language Model: Interpretability Studies in Claude 3.5 Haiku. https://transformer-circuits.pub/2025/attribution-graphs/biology.html ↩↩
NDTV (2023). "Robot Lawyer" Faces Lawsuit For Practicing Law Without A License In US". https://www.ndtv.com/feature/robot-lawyer-faces-lawsuit-for-practicing-law-without-a-license-in-us-3855043donotpay-lawsuit-practicing-law-without-license ↩

5.2. Can You Trace the Model’s Thinking—or Just Its Output?