Skip to content

6.1.3. The Checklist Trap – When Formal V&V Misses Real Risk

The Checklist Trap: Why Formal V&V Often Misses Real-World Risk

While the previous section focused on validation as a real-world challenge, this section examines how validation processes themselves can create a false sense of readiness.

A False Sense of Security

Many AI teams rely on formal verification and validation (V&V) procedures to declare their systems ready for launch. These procedures typically involve documentation, predefined test cases, and regulatory compliance. When everything is checked off, the system is marked as "safe."

But here’s the problem:

Checklists only prove you tested what you remembered to test.

Real-world failures often emerge not from broken components, but from missing context, conditions no one thought to simulate. Formal procedures rarely capture interactions, edge-case ambiguity, or system-level surprises.

In critical systems, what’s not on the checklist is often what matters most.

Case Study 020: Cruise Robotaxi Pedestrian Dragging (Location: San Francisco, USA | Theme: System‑Level Oversight Failure)

📌 Overview:
On October 2, 2023, a Cruise autonomous taxi in San Francisco struck and then dragged a pedestrian roughly 20 feet following a prior collision caused by another vehicle.

🚧 Challenges:
Although the vehicle’s components (object detection, motion planning) performed within spec, the system failed to maintain awareness of the injured pedestrian and initiated a second movement, treating the scenario as resolved.

🎯 Impact:
The pedestrian was critically injured. California’s DMV revoked Cruise’s operating license and forced the company to suspend all autonomous operations. Nearly 1,000 vehicles were recalled.

🛠️ Action:
Cruise issued a software update ensuring vehicles would remain stationary in similar situations. The company paid a $1.5M federal penalty for failing to disclose details of the crash promptly.

📈 Results:
The incident revealed a deep governance gap: validated components passed, but no human had authority to intervene in real-time. Public trust collapsed, and operations were halted nationwide.4

Knowit

“Cruise has lost all public trust, and they need to start over again.”
This statement by Aaron Peskin, President of the San Francisco Board of Supervisors, came after Cruise vehicles were recalled and their license suspended due to a robotaxi dragging a pedestrian 20 feet under its wheel. The state ruled Cruise posed an active danger to public safety5.

When Process Replaces Judgment

Even when each subsystem of an AI application is verified in isolation, object detection, motion planning, decision logic, the system as a whole can still behave unpredictably when deployed. This gap becomes especially dangerous in autonomous systems that operate in dynamic environments.

A striking example occurred in 2023, when a Cruise autonomous vehicle in San Francisco struck a pedestrian who had already been hit by another vehicle. After coming to a stop, the robotaxi proceeded to move forward again, dragging the injured person approximately 20 feet. Technically, Cruise’s object detection and navigation systems met their design specifications. Yet the system failed to maintain awareness of the pedestrian during the critical moment of re-engagement. No formal checklist had accounted for this type of real-world recovery scenario.

The aftermath revealed not only a failure in perception continuity, but also flaws in reporting and system accountability, prompting regulatory action and the suspension of Cruise’s driverless operations. This incident illustrates how component-level verification cannot substitute for system-level judgment, especially when human safety is involved.

Standards like ISO/IEC 420011 and ISO/IEC 238942 explicitly emphasize this point: validation must reflect the entire system's behavior under real-world conditions, not just internal component checks. Similarly, the NIST AI Risk Management Framework3 warns that assurance grounded solely in internal controls or paper-based V&V procedures can create a false sense of readiness.

Thinkbox

“The checklist said go. The system said stop. Nobody listened to the system.”
The Cruise incident revealed that formal validation isn’t enough. If the architecture lacks escalation triggers or post-event awareness, safety collapses, even if every box was checked. This aligns with the NIST AI RMF’s warning that compliance is not a substitute for contextual judgment.

Validating Systems, Not Just Components

Avoiding the checklist trap means shifting focus from whether each part works, to whether the entire system behaves safely in deployment conditions.

Table 42: System-Level Validation Techniques

Technique Role in Resilient Validation
Environment emulation Simulate live data, diverse edge cases, and interface behaviors
Workflow validation Analyze full-system behavior across inputs, users, and APIs
Traceability mapping Link each test case to specific operational risks, not just specs

This approach recognizes that failures rarely occur at the model level alone, they happen when system assumptions meet reality.

🎯 A “safe” model embedded in an unvalidated system can still cause harm.

Without system-level validation, teams risk treating AI readiness as a checklist, when it should be a multi-layered safety challenge.

Checklists can prove compliance. But trust is built by testing the system, not just its parts.
Because when failure happens in the field, no one blames the checklist, they blame the system.


  1. ISO/IEC. (2023). ISO/IEC 42001: Artificial intelligence – Management system. International Organization for Standardization. https://www.iso.org/standard/81230.html 

  2. ISO/IEC. (2022). ISO/IEC 23894: Artificial intelligence – Risk management. International Organization for Standardization. https://www.iso.org/standard/77304.html 

  3. National Institute of Standards and Technology. (2023). AI Risk Management Framework 1.0. https://www.nist.gov/itl/ai-risk-management-framework 

  4. Tesla Oracle. (2024, October 22). Tesla FSD investigated by NHTSA for 4 incidents out of 2.4 million vehicles and billions of miles driven. https://www.teslaoracle.com/2024/10/22/tesla-fsd-investigated-by-nhtsa-for-4-incidents-out-of-2-4-million-vehicles-and-billions-of-miles-driven/ 

  5. ABC7 News. (2023, November). Cruise recalls nearly 1,000 driverless vehicles after crash involving pedestrian. https://abc7news.com/cruise-recall-pedestrian-crash-dmv-driverless-car/14009486/