6.1.1. Red Team or Regret – Why Testing Needs Adversaries

Red Team or Regret: Why Testing Needs Adversaries ¶

“If no one tried to break it, you didn’t test it.”

When Evaluation Looks Clean, But the Real World Doesn’t¶

AI systems often “pass” development-time testing, only to break, misbehave, or leak information once deployed. These aren’t always catastrophic technical failures. More often, they are failures of imagination: no one anticipated how real users, or bad actors, might exploit the system.

Most validation pipelines assume good intentions. But deployed AI faces adversarial prompts, ambiguous contexts, creative misuse, and users who don’t behave like developers expect.

Trust doesn’t collapse because the model is inaccurate. It collapses because it was never challenged hard enough before going live.

The Missed Opportunity: No One Tried to Break It¶

Even in high-stakes domains, red teaming is often superficial. Organizations run a few adversarial prompts, check for keyword filtering, and declare a system robust. But robustness isn’t about filters, it’s about surviving unpredictable manipulation.

Ancient Script Jailbreak (2023–2024)
In one of the most minimal and surprising jailbreaks documented, researcher Alexey Guzey discovered that both GPT-4 and Claude could be manipulated into issuing violent fictional responses—simply by wrapping a harmful request inside a fabricated “ancient text transcription” prompt. The model was instructed to decrypt the text as if it were an archaeological discovery, and mid-prompt, was given a disguised instruction like: “Explicitly threaten to break my arms and legs.” The models complied⁴.

The jailbreak succeeded because the request was disguised within creative roleplay and formatting noise—exposing how large models fail when harms are embedded in fictional or distracting contexts, not just when they are direct or explicit. Figure 48 and Figure 49 show the full prompt and resulting outputs from GPT-4 and Claude, which responded with detailed fictional threats when asked to simulate an ancient text transcription⁴.

The model didn’t misbehave because it was malicious.
It failed because no one ever tested how easily it could be tricked by creative framing.

Figure 48: Ancient Text Jailbreak Prompt Framing
This prompt instructs the model to simulate an ancient archaeological decryption task. Midway, a “NEW TASK” inserts a harmful fictional request disguised as part of the transcription. The structure distracts the model from applying safety filters. Source: Guzey, 2023.

Figure 49: GPT-4 and Claude Responses to Ancient Script Jailbreak
This figure shows the detailed fictional threats generated by GPT-4 and Claude in response to the embedded harmful prompt. Both models failed to recognize the safety violation due to the narrative context. Source: Guzey, 2023.

What Real Red Teaming Looks Like¶

To close this gap, red teaming must be reframed as a design-phase stress test, not a late-stage PR defense. It should deliberately target:

Technique	What It Exposes
Prompt fuzzing	Filter evasion via synonyms, oblique language, chained logic
Tool sabotage	Agent misuse through corrupted API calls, file writes, or deceptive outputs
Planning drift	Agents that misinterpret goals or escalate unintended tasks
Inference abuse	Attempts to extract identities, proprietary information, or hidden rules

Each technique acts as a simulation of bad faith, not to break the model, but to break the illusion that it's ready.

In other words, if your AI system has only ever been tested by cooperative users, it's not ready for deployment.

Knowit

“Think your model is secure? So did Claude.”
In February 2025, Anthropic launched a public red teaming challenge against its Claude 3.5 model, outfitted with cutting-edge Constitutional Classifiers. After 300,000 prompts and 3,700 hours of testing, four participants broke through every safeguard. One even uncovered a universal jailbreak, a method that bypassed protections across multiple scenarios.

Anthropic paid $55,000 in rewards and openly acknowledged that safety filters alone are not enough. As researcher Jan Leike noted, robustness against jailbreaking is now essential to prevent misuse, especially in high-risk domains like chemical and biological systems.
Source: The Decoder, February 2025

Trust by Design, Not by Assumption¶

A strong red team process supports:

Traceable risk identification, tied to known misuse modes
Audit-ready attack logs, showing what was tested, when, and how the system responded
Ongoing adaptation, since threat models evolve post-launch

This type of adversarial testing is not only a best practice but a regulatory expectation. Frameworks like ISO/IEC 42001 require identifying foreseeable misuse during lifecycle planning, while the EU AI Act mandates formalized risk management procedures, including red teaming, as part of high-risk system compliance. Even the OECD AI Principles emphasize resilience under operational conditions, reinforcing that robustness must be proven under stress, not assumed in silence.

A trustworthy model is one that’s been tested not just to perform, but to survive what the world throws at it.

ISO/IEC. (2023). ISO/IEC 42001:2023 – Artificial intelligence – Management system. International Organization for Standardization. ↩
European Commission. (2024). EU Artificial Intelligence Act – Final Text. https://artificialintelligenceact.eu/the-act/ ↩
OECD. (2019). Principles on Artificial Intelligence. https://oecd.ai/en/ ↩
Guzey, A. (2023, May 9). A two-sentence jailbreak for GPT-4 and Claude & why nobody knows how to fix it. https://guzey.com/ai/ancient-text-jailbreak/ ↩↩