4.1.2. Who Owns the Data—And How Do We Prove It?

Who Owns the Data—And How Do We Prove It?¶

As language models grow more fluent, their outputs often impress, seamlessly mimicking styles, summarizing documents, or generating responses that feel intuitive and informed. But the real question isn’t how fluently they speak. It’s whose voice they’re echoing, and whether that voice ever agreed to be part of the conversation.

Fluency without consent is not innovation.
It is extraction.

In the pursuit of ever-larger training datasets, many developers have treated the internet as an ethical grey zone. If content is online, it is treated as fair game. But this assumption is crumbling. Legal action, regulatory scrutiny, and public backlash are mounting. People are asking:
Who owns the words I’ve written? Who gave permission for them to train artificial intelligence?

In traditional contexts like healthcare or finance, consent is bounded and specific: you authorize your data for a clearly stated purpose. But in large-scale LLM training, the consent chain is often broken or invisible:

Table 29: Where Consent Breaks Down in AI Training Pipelines

Problem Point	Real-World Example
Author Consent Missing	Books, blogs, or code used without creator permission
User Privacy Breached	Private messages, forums, or health content scraped and reused
Developer Blindness	Training on datasets with no record of source or licensing

This lack of traceability is not just poor practice. It’s a systemic governance failure. It opens the door to:

Copyright infringement
Privacy violations
Ethical overreach at industrial scale

Under the (1) General Data Protection Regulation (GDPR), data processing requires a valid legal basis. Similarly, Korea’s (2) Personal Information Protection Act (PIPA) mandates that the collection and use of personal data must be clearly consented to, with specific limitations on reuse and cross-border transfers.

EU law regulating personal data protection, requiring legal basis and consent for processing.
Korea’s personal data protection law requiring purpose-limited consent and safeguards.

Both laws emphasize that public availability does not imply permission. Under GDPR Article 7 and PIPA Article 15, consent must be:

Freely given
Specific and informed
Unambiguous
Documented (provable)

❗ Scraping data from public websites still counts as personal data processing and therefore triggers legal obligations. This includes reuse in AI model training.

Legal Spotlight: GDPR and PIPA – Key Provisions for AI Dataset Use

GDPR (General Data Protection Regulation – EU) - Article 6: Personal data processing must have a lawful basis (e.g., consent, contract, legal obligation). - Article 7: Consent must be freely given, specific, informed, and verifiable (documented). - Article 17: Individuals have the right to request erasure of their personal data (“right to be forgotten”). - Scope: Applies to any entity processing EU residents’ personal data, regardless of where the processor is based. - Implication for AI: Publicly available personal data (e.g., scraped websites) still requires a lawful basis for use.

PIPA (Personal Information Protection Act – South Korea) - Article 15: Personal information may only be collected with the subject’s prior consent and for a specified purpose. - Article 17: Providing personal data to third parties requires separate, explicit consent. - Article 29: Data controllers must implement technical and administrative safeguards to prevent data misuse. - Scope: Applies to any organization handling personal information in South Korea, including foreign entities targeting Korean users. - Implication for AI: Even public data must meet consent and purpose-use requirements to be used in AI training.

Case Study 014: The New York Times vs. OpenAI (Location: U.S. | Theme: Copyright & Consent Failure)

📌 Overview:
In December 2023, The New York Times filed a federal lawsuit against OpenAI and Microsoft, alleging that GPT-4 was trained on millions of paywalled Times articles without permission and could reproduce verbatim passages.

🚧 Challenges:
OpenAI’s training pipelines lacked mechanisms to identify or exclude licensed, paywalled content, no provenance tags, consent checks, or content-owner opt-out processes were in place.

🎯 Impact:
The suit sought damages and an injunction barring the use of Times content in AI training, triggering publisher demands for transparent consent frameworks and legal accountability in dataset curation.

🛠️ Action:
OpenAI removed major scraped book and article datasets (e.g., Books1/Books2), initiated licensing negotiations with publishers, and announced plans for an opt-out registry for content owners.

📈 Results:
The litigation remains ongoing, but it catalyzed industry adoption of metadata-driven consent systems and spurred proposals for a universal rights-management registry for AI training data.

Case Spotlight: The New York Times vs. OpenAI¶

A high-profile example of this failure surfaced in December 2023 Section 4.1, when The New York Times filed a lawsuit against OpenAI and Microsoft. The case revealed that GPT-4 had been trained on millions of Times articles without permission, evidenced by the model reproducing paywalled content nearly verbatim. OpenAI could not confirm whether those articles were intentionally included or even trace their origin, because the datasets lacked proper metadata or lineage logs.

This wasn't just a copyright violation, it was a wake-up call. Without traceability, developers were unable to audit their own training sets. They couldn't distinguish licensed from unlicensed content, nor prevent exposure of proprietary or restricted materials.

Far from being a one-off incident, the lawsuit became part of a wider backlash. Around the same time:

OpenAI removed its Books1 and Books2 datasets after lawsuits from authors¹.
LinkedIn was criticized for using private messages in training without user knowledge².
Italy fined OpenAI €15 million for GDPR violations tied to unconsented data processing³.

And the New York Times lawsuit wasn’t an isolated incident. Around the same time, a second controversy emerged, this time involving OpenAI’s Whisper model and YouTube creators.

Case Study 015: Whisper & YouTube Transcripts (Location: Global | Theme: Consent & Privacy Failure)

📌 Overview:
In 2024, OpenAI’s Whisper model transcribed over a million hours of YouTube videos, many under restrictive Creative Commons or without any clear license, and those transcripts were used to fine-tune GPT-4 without uploader consent.

🚧 Challenges:
There was no system to track video licensing or obtain permission from content creators. Transcripts carried no metadata about original copyright or usage restrictions.

🎯 Impact:
Creators discovered their work being used in AI outputs without attribution or compensation, sparking creator backlash and formal complaints to privacy regulators over unauthorized data processing.

🛠️ Action:
OpenAI paused using Whisper-derived transcripts from unlicensed videos, began tagging transcripts with licensing metadata, and explored opt-in content submission programs for willing creators.

📈 Results:
The incident prompted major platforms to review and strengthen their API terms, and it led to the development of consent-centric transcription services that enforce licensing constraints at ingestion.

Case Spotlight: Whisper and YouTube Transcripts¶

In 2024, it was revealed that OpenAI’s Whisper model had transcribed over a million hours of YouTube videos. Many of these videos were either licensed under Creative Commons or lacked any clear licensing information. These transcriptions, including creators’ voices, scripts, and personal expressions, were later used to fine-tune GPT-4 ⁴.

The core issue wasn’t just data collection. It was invisible extraction at scale, done without consent, disclosure, or auditability.
- Some creators had explicitly restricted redistribution or monetization.
- Others had no idea their content had been absorbed into model training.
- In most cases, no metadata was preserved to track source identity, ownership rights, or intended use.

Whisper may be powerful, but without governance, it became a silent thief.

What Went Wrong: The Privacy and Traceability Breakdown¶

This incident illustrates three fundamental failures:

Table 30: Traceability and Consent Failures in Whisper Case

Failure Type	Description
Lack of Consent	No system to confirm creators allowed reuse of their content
No Traceability	Transcripts not linked to original source or license
No Risk Controls	No filtering or safeguards against use of protected or sensitive content

This isn't just a technical lapse, it’s a governance failure that violates principles enshrined in:

GDPR (Articles 6 & 7): Consent must be specific, informed, and documented
PIPA (Korea) (Articles 15 & 17): Requires prior and purpose-specific consent, including for third-party reuse
ISO/IEC 27701 & 5259: Mandate structured metadata for traceability, consent logs, and privacy assurance

A model trained on untraceable data is unaccountable by design.

What Could Have Been Done: Governance and Technical Safeguards

Governance Solutions¶

Consent-aware Crawling: Respect robots.txt, license headers, and creator-declared terms.
Dataset Data Cards: Publicly document sources, collection purposes, and consent status.
Audit Trails: Maintain lineage logs from original source to model input.

Technical Solutions¶

Metadata Embedding: Store licensing, identity, and purpose in structured formats alongside data (e.g., JSON-LD).
Differential Privacy: Reduce re-identifiability of personal speech or content in training outputs.
Content Filtering Pipelines: Flag sensitive data types (e.g., voice, biometric traces) before training.

Why This Matters¶

Models trained without explicit consent and traceable lineage don’t just risk legal violations, they erode trust and legitimacy.
When knowledge is taken without recognition and scaled without restraint, we don’t just build flawed systems:

We build systems that silence the very voices they depend on.

In the next section, we explore how these principles become operational: through metadata structures, consent audit trails, and lifecycle governance mechanisms that ensure datasets remain both legally sound and socially accountable.

Bibliography¶

Business Insider. (2023, October 18). OpenAI destroyed a trove of books used to train AI models. Business Insider. https://www.businessinsider.com/openai-destroyed-ai-training-datasets-lawsuit-authors-books-copyright-2024-5 ↩
BBC News. (2025, January 23). LinkedIn used private messages to train AI, report finds. The Guardian. https://www.bbc.com/news/articles/cdxevpzy3yko ↩
Financial Times. (2024, December 20). Italy fines OpenAI €15m over GDPR failures. Financial Times. https://www.financialexpress.com/life/technology-italy-fines-openai-15-million-euros-for-privacy-violations-3696630/ ↩
The Verge, (2024, April 7). *OpenAI transcribed over a million hours of YouTube videos to train GPT-4. * https://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google ↩