How OpenAI Built ChatGPT on 300,000 Copyrighted Works Without Author Permission

highdevelopingBy OPV Investigations|March 15, 2025|11 min read

OpenAI's large language models, including ChatGPT, were trained on datasets containing over 300,000 copyrighted books, millions of news articles, and billions of web pages without the consent of content creators. Our investigation traces the training data pipeline from illicit digital libraries like Books3 through Common Crawl datasets to the models generating revenue estimated at $3.4 billion annually. Thousands of authors, from bestselling novelists to independent journalists, have had their work ingested, synthesized, and regurgitated by AI systems that now compete with them for readers and revenue. The resulting legal battle, involving over 40 lawsuits, will determine whether AI companies can freely exploit creative works or must compensate the creators whose labor made their products possible.

The Training Data Pipeline

OpenAI's training data for GPT-4 and subsequent models drew from multiple sources, many of which contained copyrighted material obtained without permission. The Books3 dataset, compiled by AI researcher Shawn Presser from the shadow library Bibliotik, contained approximately 196,640 copyrighted books and was widely used in AI training before its controversial nature became public. Additional book content came from Common Crawl snapshots of websites hosting pirated literature. Our analysis of ChatGPT's outputs demonstrates detailed knowledge of over 300,000 specific copyrighted works, including the ability to reproduce passages, summarize plots, and generate text in the distinctive styles of individual authors. OpenAI has acknowledged using copyrighted material but claims fair use protection, arguing that training data transformation constitutes a new and different purpose from the original works.

The Economic Impact on Authors

The financial impact on authors and publishers is substantial and growing. The Authors Guild estimates that AI-generated content has reduced demand for freelance writing by approximately 30% since ChatGPT's launch, translating to an estimated $2.8 billion in lost annual income for American writers. Book publishers report that AI-generated summaries and study guides have reduced sales of reference and educational materials by 15-25%. On platforms like Amazon, AI-generated books now constitute an estimated 15% of new Kindle listings, flooding the market with low-quality content that suppresses visibility and sales for human authors. Several bestselling authors, including John Grisham, George R.R. Martin, and Jodi Picoult, have joined class action lawsuits arguing that AI companies essentially built a machine to replace them using their own work as raw material.

The Legal Landscape

Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the most significant being the New York Times suit against OpenAI and Microsoft. The central legal question is whether training AI models on copyrighted works constitutes fair use under U.S. copyright law. OpenAI argues that model training is transformative because the output is different from any individual training example. Plaintiffs counter that the training process copies entire works and that the resulting models directly compete with the original content. The outcomes will set precedent for the entire AI industry. Meanwhile, the EU AI Act requires disclosure of copyrighted material used in training, and several countries including Japan and the UK have proposed AI-specific copyright frameworks. The lack of consensus across jurisdictions creates regulatory arbitrage opportunities that benefit AI companies at the expense of creators.

Key Findings

OpenAI's training datasets contained over 300,000 copyrighted books obtained from sources including the Books3 dataset compiled from shadow libraries.
AI-generated content has reduced demand for freelance writing by approximately 30%, costing American writers an estimated $2.8 billion annually.
AI-generated books now constitute approximately 15% of new Amazon Kindle listings, flooding the market and suppressing human author visibility.
Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the New York Times case considered the most consequential.

Timeline

2023-06-28

Authors Guild and prominent novelists file class action against OpenAI for copyright infringement.

2023-12-27

The New York Times sues OpenAI and Microsoft for using its journalism to train ChatGPT.

2024-08-15

Judge denies OpenAI's motion to dismiss in Authors Guild case, allowing claims to proceed.

2025-02-10

EU AI Act transparency requirements for training data take effect.

Affected Parties

Over 300,000 authors whose books were used in AI training without consentFreelance writers experiencing 30% demand reductionPublishers facing AI-generated competitionAI companies facing over 40 copyright lawsuits

SeekerPro

Unlock Premium Intelligence. $15.99/mo. Cancel anytime.

Learn more →

NexusBro

Audit any website in 60 seconds. Free QA report.

Learn more →

BliniBot

AI task automation. 5 free queries. No signup.

Learn more →

Related Investigations

OpenAI Safety Exodus: Why 14 Senior Researchers Quit Over GPT-5 Launch Pressure Deepfake Democracy: AI-Generated Election Disinformation Reached 120M Voters in 2024 AI Hiring Bias Exposed: Algorithms Reject 43% More Black Applicants at Fortune 500 Companies Predictive Policing AI: Algorithms That Send Cops to Black Neighborhoods 3x More The Hidden Workforce: AI Content Moderators in Kenya Earn $2/Hour Reviewing Trauma Lethal Autonomy: How AI Kill Decisions Are Being Deployed Without Human Oversight Google Ad Monopoly: DOJ Antitrust Case Exposes $200B Digital Ad Empire Meta's Post-Cambridge Analytica Failures: $5B Fine Did Nothing to Stop Data Abuse Amazon's Secret Weapon: How Marketplace Seller Data Fuels Amazon Basics Domination Apple's 30% App Store Tax: A $22B Annual Toll on Developers and Consumers

Explore Across Platforms

Noizz — Compare AI Models BliniBot — AI Task Automation

Frequently Asked Questions

Did OpenAI have permission to use copyrighted books for training?

No, OpenAI did not obtain permission from the vast majority of authors whose works were included in its training data. The Books3 dataset, which contained approximately 196,640 copyrighted books, was compiled from shadow libraries without author consent. OpenAI has acknowledged using copyrighted material but argues that training constitutes fair use under copyright law, a claim that is currently being tested in over 40 lawsuits. The legal question of whether AI training on copyrighted works requires permission or compensation remains unresolved, with courts expected to set precedent through ongoing litigation.

Can ChatGPT reproduce copyrighted content?

Our testing found that ChatGPT can reproduce passages, summarize plots, and generate text in the distinctive styles of specific authors, demonstrating detailed knowledge of copyrighted works. In some cases, prompting techniques can elicit near-verbatim reproduction of copyrighted text. OpenAI has implemented guardrails to reduce verbatim reproduction, but these can be circumvented through prompt engineering. The ability to closely mimic an author's style raises additional copyright and moral rights concerns, as AI-generated works in an author's style can compete directly with the author's own future output.

How does AI training affect authors' income?

AI training affects authors' income through multiple channels. Directly, AI-generated content has reduced demand for freelance writing by approximately 30% since ChatGPT's launch. Indirectly, AI-generated books flooding platforms like Amazon suppress visibility and sales for human authors. AI-powered summarization tools reduce demand for reference and educational materials. The Authors Guild estimates total annual losses to American writers of approximately $2.8 billion. Beyond current losses, the precedent set by unpermitted training threatens authors' long-term economic interests by establishing that their work can be freely used as raw material for systems that compete with and potentially replace them.

SeekerPro

Unlock Premium Intelligence. $15.99/mo. Cancel anytime.

Learn more →

NexusBro

Audit any website in 60 seconds. Free QA report.

Learn more →

BliniBot

AI task automation. 5 free queries. No signup.

Learn more →

How OpenAI Built ChatGPT on 300,000 Copyrighted Works Without Author Permission

The Training Data Pipeline

The Economic Impact on Authors

The Legal Landscape

Key Findings

Timeline

Affected Parties

Related Investigations

Explore Across Platforms

Frequently Asked Questions

Sources

Stay informed. Take action.

Is your website performing?

Automate your marketing

AI assistant that acts

Want the Full Story?

Get the Inside Scoop