How OpenAI Built ChatGPT on 300,000 Copyrighted Works Without Author Permission
OpenAI's large language models, including ChatGPT, were trained on datasets containing over 300,000 copyrighted books, millions of news articles, and billions of web pages without the consent of content creators. Our investigation traces the training data pipeline from illicit digital libraries like Books3 through Common Crawl datasets to the models generating revenue estimated at $3.4 billion annually. Thousands of authors, from bestselling novelists to independent journalists, have had their work ingested, synthesized, and regurgitated by AI systems that now compete with them for readers and revenue. The resulting legal battle, involving over 40 lawsuits, will determine whether AI companies can freely exploit creative works or must compensate the creators whose labor made their products possible.
The Training Data Pipeline
OpenAI's training data for GPT-4 and subsequent models drew from multiple sources, many of which contained copyrighted material obtained without permission. The Books3 dataset, compiled by AI researcher Shawn Presser from the shadow library Bibliotik, contained approximately 196,640 copyrighted books and was widely used in AI training before its controversial nature became public. Additional book content came from Common Crawl snapshots of websites hosting pirated literature. Our analysis of ChatGPT's outputs demonstrates detailed knowledge of over 300,000 specific copyrighted works, including the ability to reproduce passages, summarize plots, and generate text in the distinctive styles of individual authors. OpenAI has acknowledged using copyrighted material but claims fair use protection, arguing that training data transformation constitutes a new and different purpose from the original works.
The Economic Impact on Authors
The financial impact on authors and publishers is substantial and growing. The Authors Guild estimates that AI-generated content has reduced demand for freelance writing by approximately 30% since ChatGPT's launch, translating to an estimated $2.8 billion in lost annual income for American writers. Book publishers report that AI-generated summaries and study guides have reduced sales of reference and educational materials by 15-25%. On platforms like Amazon, AI-generated books now constitute an estimated 15% of new Kindle listings, flooding the market with low-quality content that suppresses visibility and sales for human authors. Several bestselling authors, including John Grisham, George R.R. Martin, and Jodi Picoult, have joined class action lawsuits arguing that AI companies essentially built a machine to replace them using their own work as raw material.
The Legal Landscape
Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the most significant being the New York Times suit against OpenAI and Microsoft. The central legal question is whether training AI models on copyrighted works constitutes fair use under U.S. copyright law. OpenAI argues that model training is transformative because the output is different from any individual training example. Plaintiffs counter that the training process copies entire works and that the resulting models directly compete with the original content. The outcomes will set precedent for the entire AI industry. Meanwhile, the EU AI Act requires disclosure of copyrighted material used in training, and several countries including Japan and the UK have proposed AI-specific copyright frameworks. The lack of consensus across jurisdictions creates regulatory arbitrage opportunities that benefit AI companies at the expense of creators.
Key Findings
- OpenAI's training datasets contained over 300,000 copyrighted books obtained from sources including the Books3 dataset compiled from shadow libraries.
- AI-generated content has reduced demand for freelance writing by approximately 30%, costing American writers an estimated $2.8 billion annually.
- AI-generated books now constitute approximately 15% of new Amazon Kindle listings, flooding the market and suppressing human author visibility.
- Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the New York Times case considered the most consequential.
Timeline
Authors Guild and prominent novelists file class action against OpenAI for copyright infringement.
The New York Times sues OpenAI and Microsoft for using its journalism to train ChatGPT.
Judge denies OpenAI's motion to dismiss in Authors Guild case, allowing claims to proceed.
EU AI Act transparency requirements for training data take effect.