AI Training Data: The Consent Problem Nobody Solved
Every major AI model was trained on massive datasets scraped from the internet without meaningful consent from content creators, website operators, or individuals whose personal information was included. The Common Crawl dataset underlying many models contains over 250 billion web pages including copyrighted articles, personal blogs, medical forum posts, and private conversations that were publicly accessible but never intended as AI training material. Multiple lawsuits from publishers, artists, and authors challenge the legality of this mass appropriation.
Scale of Unconsented Data Use
OpenAI trained GPT-4 on datasets including Common Crawl, WebText2, books corpora, and Wikipedia totaling trillions of tokens. Meta trained Llama on similar datasets. Google trained Gemini on its proprietary index of the web. None of these companies obtained individual consent from the billions of content creators whose work was included. The terms of service of most websites do not authorize AI training, and the robots.txt protocol was routinely ignored or interpreted as non-binding.
Legal Challenges
The New York Times, Getty Images, and numerous individual authors and artists have filed lawsuits alleging copyright infringement through unauthorized training. The core legal question is whether training AI on copyrighted material constitutes fair use or transformative use under copyright law. No definitive ruling has been issued, but early decisions suggest the answer may depend on whether AI outputs compete with the original works. The EU AI Act requires GPAI providers to comply with copyright law and publish training data summaries.
Personal Data Implications
Beyond copyright, AI training data includes personal information protected by privacy laws. Medical questions posted on health forums, dating profiles, private social media posts that were publicly accessible, and personal photographs are all present in training datasets. Under GDPR, processing personal data requires a legal basis such as consent or legitimate interest. No AI company has established a clear legal basis for processing the personal data of billions of individuals included in web-scraped training sets.
Key Findings
- AI models were trained on trillions of tokens scraped from the internet without meaningful consent from content creators
- Multiple major lawsuits challenge whether AI training on copyrighted material is legal under fair use doctrine
- Training datasets contain personal data protected by GDPR with no clear legal basis for processing
Timeline
Getty Images sues Stability AI for copyright infringement in training
New York Times sues OpenAI and Microsoft over training data
EU AI Act GPAI transparency requirements take effect
First major court ruling on AI training data fair use expected