Wrap Up
Points to remember
-
AI model trustworthiness is built through layered safeguards—design logic, faithful explanation, fairness integrity, and output reliability must work together.
-
Models that cannot reveal their reasoning at the design stage create hidden failure points; explainability must be part of architecture, not just documentation.
-
Over-delegation without contestability erodes human authority; systems like the Apple Card and Uber’s self-driving car showed the cost of oversight gaps.
-
Interpretation tools (e.g., SHAP, LIME, saliency maps) often provide comfort rather than truth, aligning with user expectations rather than exposing causal logic.
-
Plausible explanations can mask serious flaws; faithfulness requires tools like token-level attribution, causal tracing, and circuit diagnostics.
-
Fairness audits focused on global metrics can miss harm hidden in subgroups; parity metrics alone are not sufficient.
-
Bias often arises not from intent but from feature proxies (e.g., ZIP code, cost as proxy for health), untested thresholds, or context-free inclusion.
-
Models can pass fairness audits and still amplify structural inequalities—as seen in risk scoring tools and representational bias cases like Twitter cropping.
-
Output integrity is the last safeguard: confident errors, hallucinations, and prompt injection can mislead users even when models are well designed.
-
Trustworthy models integrate safeguards at every layer:
- Confidence overlays and uncertainty-aware architectures
- Causal path diagnostics, token-level attribution, circuit tracing
- Subgroup calibration curves, bias amplification tests, counterfactual probes
- Hallucination control layers, prompt injection defenses, output traceability
-
Legal and standards anchors—including ISO/IEC 23894 (AI risk management), ISO/IEC 24028 (trustworthiness in AI), ISO/IEC 24027 (bias in AI), and the EU AI Act—mandate that explainability, fairness, and integrity be embedded in model design for high-risk AI.
-
Case studies in this chapter illustrate how model-level gaps translate into real-world harm:
- Apple Card: Delegation without human contestability
- Uber self-driving car: Oversight without functional authority
- Twitter cropping: Inclusion without understanding
- DoNotPay / Air Canada chatbot: Plausibility without truth