5.4. Why You Can’t Always Trust What AI Says, From Hallucinations to Exploitation

Why You Can’t Always Trust What AI Says , From Hallucinations to Exploitation¶

💬 “The most dangerous errors are the ones you believe without question.”
, Timnit Gebru, AI ethics researcher

In the last section, we examined how fairness failures can cause harm even when metrics look good. Now we turn to another critical dimension of trustworthy AI: whether we can trust what AI tells us at all.

Modern AI systems excel at (1) fluency. They produce text that reads well, sounds plausible, and persuades us to act. But fluency can mask failure ⁵. And when failure goes undetected, whether through hallucinated facts or exploited vulnerabilities, the consequences can be legal, ethical, and social ⁶.

The quality of AI-generated text sounding natural, coherent, and well-formed in a given language.

This section confronts a hard truth: AI fluency can deceive as easily as it can inform. And trust, once given, can be exploited.

⚠️ When Language Misleads or Betrays¶

We explore how AI systems generate confident but false outputs , and ask:

What happens when we trust them without question? Who pays the price when fluency replaces truth?

Consider:

The Air Canada chatbot, where a fabricated refund policy wasn’t just a glitch , it cost a customer real money, and forced a court to remind us that “autonomous” doesn’t mean unaccountable.
The ChatGPT legal filing, where invented case citations weren’t harmless errors , they resulted in court sanctions and professional humiliation because a lawyer trusted the AI’s confidence more than his own judgment.
The jailbreaks that tricked Claude and GPT-4 into generating bomb-making instructions , proof that what sounds helpful can be turned against us, with dangerous precision.

These aren’t edge cases. They are warnings. Every time we mistake fluency for (1) fidelity, we open the door to harm.

The degree to which an AI system’s explanation accurately reflects its internal reasoning or decision process.

📘 Legal and Ethical Stakes¶

Under the EU AI Act, Article 52 requires that users are informed when interacting with AI, and that outputs avoid material deception in regulated domains. Similarly, ISO/IEC 23894:2023 frames output misuse risk and adversarial exploitation as central to AI risk management. China’s 2022 Algorithmic Recommendation Regulations demand safeguards against manipulative or harmful algorithmic behavior in deployed systems.

But legal safeguards are minimums. Trustworthy design demands that AI systems support contestability, uncertainty signaling, and resistance to misuse from the start.

Here, we will explore how output trust fails, and what can be done:

How AI systems generate hallucinations and misleading responses, and how design can surface uncertainty.
How users become the last safeguard when systems fail, and how to equip them to contest AI outputs.
How models are actively exploited through adversarial prompts and jailbreaks, and how to design for robustness against such attacks.

Each part provides not just failure cases, but design strategies: from refusal defaults and knowledge grounding, to guardrail-first decoding and adversarial robustness training. Because trustworthy AI doesn’t just speak well , it earns trust through transparency, fidelity, and resilience.

Bibliography¶

Currie v. Air Canada, 2024 BCCRT 149. Civil Resolution Tribunal of British Columbia. https://www.canlii.org/en/bc/bccrt/doc/2024/2024bccrt149 ↩
Schwartz v. Avianca, Inc., U.S. District Court, Southern District of New York, 2023. ↩
Anthropic. (2025, March). Jailbreaking Claude: Interpretability and Control in LLMs. https://www.anthropic.com/news/claude-3-5 ↩
Cyberspace Administration of China. (2022). Administrative Provisions on Algorithmic Recommendation of Internet Information Services. http://www.cac.gov.cn/2021-12/31/c_1642894604301234.htm ↩
OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. https://doi.org/10.48550/arXiv.2303.08774 ↩
Ji, Z., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730 ↩