4.3.2. Who’s Responsible for a Dataset After It’s Created?
Who’s Responsible for a Dataset After It’s Created?¶
"A trustworthy dataset isn’t just built once. It is cultivated, versioned, and watched, like infrastructure."
Many conversations about fairness and auditability focus on model outputs or initial data collection. But in real-world pipelines, the most common point of failure lies elsewhere: in what happens between. Between the moment data is collected and the moment it’s used, it passes through layers of transformation, augmentation, labeling, and filtration. If those stages are undocumented or ungoverned, trust can quietly erode, regardless of how ethical the original intent may have been.
Where Trust Breaks Down¶
Imagine a dataset collected from hospital admissions. Initially, it’s complete and representative. But over time:
- Sensitive features (e.g., ethnicity, insurance type) are removed during anonymization.
- Some records are synthetically expanded using generative tools, but without clear documentation.
- Labels are corrected based on shifting diagnostic standards, but version control is loose.
Eventually, a model trained on this data performs poorly for specific subgroups. Yet developers can’t trace back what changed, or why.
The failure here wasn’t malicious. It was a failure of lineage to stewardship.
Here, we check the important feature in the data lifecycle trust.
Visualizes how trust or risk levels (e.g., 'Looks Good', 'Medium Risk', 'High Risk') flow across interconnected data nodes in a pipeline. Useful for understanding how upstream data issues can affect downstream outputs. (Source)
1. Lineage: Seeing How Data Evolved¶
Data lineage refers to the ability to track a dataset’s journey: where it came from, how it was changed, and who changed it. Without this traceability, it’s impossible to perform ethical audits, reproduce decisions, or unwind errors.
ISO/IEC 5259-3:2024, Clause 7.3.4.4 – Data Handling
Requires organizations to establish processes for traceability of data origin, modifications, and version control.
This effectively covers lineage and provenance, ensuring that dataset evolution (sources, processing logs, and transformations) is documented.
➤ These practices directly support transparency and audit readiness.1
A dataset without lineage is a black box. And a black box cannot be governed.
2. Curation: More Than Maintenance¶
Lineage tells you what changed. Curation explains why. Ethical dataset stewardship requires curators to:
Illustrates how different data nodes in a curation workflow propagate varying risk levels. Helps identify where interventions like audits, re-labeling, or traceability checks are most needed. (Source)
- Regularly assess relevance and bias across subgroups
- Document every change, including what was removed or added
- Track who performed the curation and under what assumptions
“Curation is governance made visible.” , Paraphrased from ISO/IEC 38505-1 on data governance accountability2
Unfortunately, many AI teams treat datasets as static assets. There’s a download step, maybe a filter, and then years of reuse. But like software, datasets evolve, and need changelogs, owners, and update schedules.
3. Data Tracking¶
As (1) synthetic datasets become more common, particularly for vision-language tasks, researchers have raised concerns about drift introduced through self-reinforcing loops. One documented risk is when captioning models are trained on synthetic outputs originally generated by other models, leading to circular learning.
- Artificially created data that mirrors the statistical characteristics of real data.
Outlines the critical functions performed by data stewards, including identity management, compliance monitoring, policy communication, and security culture enhancement. Data stewards serve as a bridge between governance policies and operational execution.
Source: Digital Curation Centre
- Image-text pairs are generated using automated captioning systems
- These captioning models were themselves trained on scraped or synthetic corpora
- Over time, the model begins to overfit to its own stylistic biases and ignore natural linguistic diversity
Lesson from Drift
Reusing model-generated captions without source diversity or provenance tracking can create a feedback loop of linguistic uniformity and conceptual narrowing, especially in visual tasks.
This kind of distributional drift often remains hidden unless metadata and data lineage tracking are in place to surface it.
4. Stewardship: The Human Layer¶
Curation isn’t just about pipelines. It’s about responsibility. Someone must own:
- Dataset versioning and update protocols
- Ethical review checkpoints
- Communication of known limitations to users and auditors
Data Steward Role
ISO/IEC 5259-3 outlines the role of dataset stewards who are responsible for continuous dataset quality monitoring and lineage capture.
➤ These roles are critical for regulated sectors like healthcare and finance1.
Synthetic Data as a Curation Strategy
In cases where original data cannot be retained due to privacy concerns, licensing limits, or underrepresentation of key groups, synthetic data can serve as an effective remediation tool.
- Privacy Protection: Techniques like GANs, VAEs, or diffusion models generate records that maintain statistical patterns without exposing PII4.
- Fairness Repair: Minority feature combinations (e.g., “non-binary veterans”) can be created to correct data gaps.
- Regulatory Compliance: Synthetic datasets help satisfy constraints under GDPR, HIPAA, or PIPA, where real data use is limited.
📊 According to the U.S. National Institute of Standards and Technology (NIST), synthetic data must be evaluated for statistical fidelity, privacy leakage risk, and downstream utility5.
While synthetic data is not a cure-all, it offers a flexible option for safe, inclusive dataset refinement when the original is inaccessible or ethically problematic.
Why This Matters¶
A dataset’s trustworthiness is not fixed at creation. It must be defended, documented, and improved over time. That means:
- Tracking how data is reshaped during preprocessing or augmentation
- Creating version histories with reasons for changes
- Establishing custodial roles for ethical oversight
Stewardship and maintenance are essential, but only if we can put trust into practice.
In high-impact domains like housing, healthcare, or social services, trust isn't just a principle, it's a requirement. Models trained on flawed data can exclude families, misallocate benefits, or reinforce systemic bias.
In the next section, we apply everything learned so far through a structured audit challenge.
Bibliography¶
-
ISO/IEC 5259-3:2024(E). Artificial Intelligence , Data quality for analytics and machine learning , Part 3: Data quality management process. ↩↩
-
ISO/IEC 38505-1:2017. Governance of data , Part 1: Application of ISO/IEC 38500 to the governance of data. ↩
-
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., ... & Jitsev, J. (2022). Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in neural information processing systems, 35, 25278-25294. ↩
-
Jordon, J., et al. (2022). Synthetic Data Generation for Privacy-Preserving Machine Learning: A Review. Journal of Privacy and Confidentiality. ↩
-
NIST. (2021). NISTIR 8219: A Taxonomy and Terminology of Adversarial Machine Learning. https://doi.org/10.6028/NIST.IR.8219 ↩