Skip to content

3.3.1. Oversight in Action (and Inaction) - Lessons from the Real World

Oversight in Action (and Inaction) - Lessons from the Real World

“The system failed. But no one knew who could stop it. Or when. Or how.”

Human oversight doesn’t fail because people are careless.
It fails when systems make intervention impossible, by design, speed, or omission.

When oversight lacks authority, clarity, or access to critical information, the result is not simply a governance lapse.
It’s a failure of risk routing: the right signals don’t reach the right humans in time to matter.

The following case studies reveal that oversight failures are not isolated incidents, but systemic breakdowns. For each case, we use ISO 31000 and ISO/IEC 23894 to show what should have been built in, but wasn’t.

Case Study 013: Ofqual A-Level Grading Algorithm (Location: United Kingdom | Theme: Education Algorithm Bias)

📌 Overview
In 2020, the UK’s Office of Qualifications and Examinations Regulation (Ofqual) implemented an algorithm to assign A-Level grades after exams were cancelled due to the COVID-19 pandemic. The algorithm combined teacher predictions, historical school performance, and national standardization to determine final results4.

🚧 Challenges
The algorithm downgraded approximately 39% of predicted grades, with a higher likelihood of downgrades in larger classes and lower-performing schools. Students and teachers were not provided with individual model rationales, and there was no formal appeal process for algorithm-assigned results at the individual level.

🎯 Impact
The grading results triggered widespread student protests, public backlash, and legal threats. On August 17, 2020, the UK government reversed the decision**, announcing that students would instead receive their **center-assessed grades (teacher predictions).

🛠️ Action
The algorithm was abandoned. Ofqual issued a formal statement accepting the policy change, and the government committed to reviewing automated decision-making in public sector algorithms[^6].

📈 Results
The incident sparked parliamentary inquiry, reviews by the Centre for Data Ethics and Innovation, and international debate over the use of algorithms in education policy.

Case Study:

During the COVID-19 pandemic, Ofqual introduced an algorithm to predict student exam grades. It did the math, but missed the meaning.

The Outcome:

  • Nearly 40% of grades downgraded
  • Disproportionate harm to lower-income students
  • Public outcry and policy reversal within a week

Oversight Breakdown:
Teachers were disempowered, unable to contest or explain outcomes.
No mechanisms existed to detect or prevent systemic demographic harm.

Standards like ISO 31000 and ISO/IEC 23894 provide structured steps that could have prevented these blind spots.

Table 20: Oversight Gaps and Framework Remedies (Case Studies 1)

Oversight Gap Risk Framework Remedy
No override mechanism ISO 31000 Clause 6.5 – Require fallback interventions before release
No explanation for predictions ISO/IEC 23894 Clause 6.5.3 – Mandate feature attribution and transparency
No demographic impact testing ISO 31000 Clause 6.3.1 – Map contextual and stakeholder risks
Note

Designing for fairness means more than mathematical parity.
It means empowering stakeholders to intervene when systems fail.

Case Study 014: COMPAS Recidivism Risk Tool (Location: United States | Theme: Algorithmic Fairness in Criminal Justice)

📌 Overview
The Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) is a risk assessment tool used in U.S. courts to predict a defendant’s likelihood of recidivism. It has been used in pretrial decisions, parole recommendations, and sentencing in several U.S. states.

🚧 Challenges
A 2016 investigation by ProPublica analyzed 7,000 criminal defendants in Broward County, Florida, and found that Black defendants were nearly twice as likely as white defendants to be incorrectly labeled high-risk for reoffending. The tool’s methodology and feature weights were proprietary and not publicly disclosed, limiting transparency2.

🎯 Impact
Defendants could not fully contest or interpret their risk scores in court. Judges relied on COMPAS outputs despite lack of explainability or individualized justification. The findings sparked national debate over the use of black-box algorithms in the justice system.

🛠️ Action
Northpointe (now Equivant), the developer of COMPAS, defended the model’s validity but did not release its internal logic. The case prompted calls for greater transparency, explainability, and legal standards in algorithmic decision-making.

📈 Results
The COMPAS controversy remains one of the most cited examples in AI ethics and law. It has influenced research on algorithmic bias, due process, and the need for auditable, explainable AI systems in public-sector decision-making7.

Case Study : COMPAS Recidivism Tool (USA)2

The COMPAS algorithm predicted who might reoffend, but no one could explain how.

The Outcome:

  • Black defendants rated as higher risk than white defendants
  • Judges relied on opaque scores with no legal contestability
  • Defendants had no access to model reasoning

Oversight Breakdown:
Judicial stakeholders were blindfolded. No transparency, no audit hooks, and no appeal process.

Risk frameworks would have made these failures visible before deployment.

Table 21: Oversight Gaps and Framework Remedies (Case Studies 2)

Oversight Gap Risk Framework Remedy
Black-box decision logic ISO/IEC TR 24028:2020 Clause 9.2 – Require explainability for high-risk AI systems
No challenge or appeal ISO 31000 Clause 6.5 – Map procedural risks to enable contestability
No audit mechanism ISO/IEC 42001 Clause 9.1 – Implement ongoing oversight with third-party review
Note

An algorithm cannot be fair if no one can challenge its judgment.

Case Study 015: Google Ads and Job Targeting (Location: United States | Theme: Algorithmic Bias in Online Advertising)

📌 Overview
In 2015, researchers from Carnegie Mellon University and the International Computer Science Institute conducted a study using AdFisher, a tool designed to simulate user behavior and measure ad delivery differences. The study found that Google Ads showed significantly fewer high-paying job advertisements to female users compared to male users, despite identical simulated browsing behavior5.

🚧 Challenges
The ad delivery process operated as a black box, with limited visibility into how user profiles and advertiser settings interacted. Researchers were unable to determine whether the bias originated from Google’s ad-serving algorithm, advertiser targeting choices, or user feedback loops.

🎯 Impact
The study revealed potential gender discrimination in employment-related advertising. The findings raised concerns about unintentional bias in automated ad delivery and lack of transparency in algorithmic decision-making systems.

🛠️ Action
Google did not disclose specific corrective measures in response to the study. The issue contributed to broader public and regulatory scrutiny of algorithmic bias in online platforms.

📈 Results
The case remains a frequently cited example in academic and policy discussions on bias and fairness in targeted advertising systems. It has influenced ongoing debate around auditability, accountability, and transparency in digital ad ecosystems6.

Case Study : Google Ads and Job Targeting3

Google’s ad algorithm showed high-paying job ads to men more often than women, without anyone noticing.

The Outcome:

  • Gender bias in ad delivery
  • Lack of visibility into targeting mechanisms
  • No internal audit flagged the discrimination

Oversight Breakdown:
This was a case of invisible harm. Users had no idea they were excluded.
Advertisers couldn’t inspect or fix what went wrong.

Here’s how technical governance could have caught it.

Table 22: Oversight Gaps and Framework Remedies (Case Studies 3)

Oversight Gap Risk Framework Remedy
No demographic auditing ISO/IEC 23894 Clause 6.6 – Require fairness testing at system level
No rationale traceability ISO/IEC 42001 Clause 9.1 – Enforce full input-output logging for transparency
No external review ISO 31000 Clause 6.5 – Treat bias breaches as risk triggers requiring intervention
Note

If your oversight system can’t see exclusion, it’s part of the problem.

Oversight Performance Metrics (What to Monitor)

Just as we monitor model accuracy, we must track oversight effectiveness.

Table 23: Oversight Effectiveness Metrics

Metric Description
Intervention Rate % of AI decisions overridden by humans (low = rubber-stamping; high = misalignment)
Time-to-Override How quickly humans can act on risky AI outputs
Explanation Utilization % of times humans accessed system explanations before action
Appeal Frequency # of downstream challenges to AI-influenced decisions
Escalation Ratio # of human overrides escalated for further review

Final Reflection

AI oversight only works when:

  • Humans are treated as final arbiters, not passive witnesses
  • Systems are built to surface risk, not obscure it
  • Interfaces are designed for action, not illusion
TRAI Challenges: Oversight That Works vs. Oversight That Fails


📘 Scenario:
Choose one real case from Section 3.3 (A-Level grading, COMPAS, or Google Ads). Each system failed, but oversight was different.


🎯 Your Task:
- Describe what went wrong with oversight in that case
- Map the failure to one or more missing elements of the 3Cs of Oversight
- Suggest one design fix using ISO 31000, ISO/IEC 23894, or EU AI Act Article 14

In the next section, we shift from individual oversight to system-level accountability:

  • How do organizations, not just people, own the outcomes of the AI they build and deploy?
  • What structures ensure that responsibility scales with complexity?

Bibliography


  1. Heaven, W. D. (2020). Why it’s hard to hold algorithms accountable for bias. MIT Technology Review. https://www.technologyreview.com/2016/11/17/155957/how-to-hold-algorithms-accountable/ 

  2. Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias: There’s software used across the country to predict future criminals. And it’s biased against blacks. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing 

  3. Datta, A., Tschantz, M. C., & Datta, A. (2015). Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. Proceedings on Privacy Enhancing Technologies, 2015(1), 92–112. https://doi.org/10.1515/popets-2015-0007 

  4. Ofqual. (2020). Standardising grades: Ofqual's approach. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/945802/Standardisation_of_grades_in_general_qualifications_in_summer_2020_-_outliers.pdf 

  5. Datta, A., Tschantz, M. C., & Datta, A. (2015). Automated Experiments on Ad Privacy Settings. Proceedings on Privacy Enhancing Technologies. https://doi.org/10.1515/popets-2015-0007 

  6. Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. https://fairmlbook.org 

  7. USACM. (2017). Statement on Algorithmic Transparency and Accountability. Association for Computing Machinery. https://www.acm.org/binaries/content/assets/public-policy/2017_usacm_statement_algorithms.pdf