Skip to main content

Independent journalism powered by readers like you.

How OpenAI Built ChatGPT on 300,000 Copyrighted Works Without Author Permission

highdevelopingBy OPV Investigations||11 min read

OpenAI's large language models, including ChatGPT, were trained on datasets containing over 300,000 copyrighted books, millions of news articles, and billions of web pages without the consent of content creators. Our investigation traces the training data pipeline from illicit digital libraries like Books3 through Common Crawl datasets to the models generating revenue estimated at $3.4 billion annually. Thousands of authors, from bestselling novelists to independent journalists, have had their work ingested, synthesized, and regurgitated by AI systems that now compete with them for readers and revenue. The resulting legal battle, involving over 40 lawsuits, will determine whether AI companies can freely exploit creative works or must compensate the creators whose labor made their products possible.

The Training Data Pipeline

OpenAI's training data for GPT-4 and subsequent models drew from multiple sources, many of which contained copyrighted material obtained without permission. The Books3 dataset, compiled by AI researcher Shawn Presser from the shadow library Bibliotik, contained approximately 196,640 copyrighted books and was widely used in AI training before its controversial nature became public. Additional book content came from Common Crawl snapshots of websites hosting pirated literature. Our analysis of ChatGPT's outputs demonstrates detailed knowledge of over 300,000 specific copyrighted works, including the ability to reproduce passages, summarize plots, and generate text in the distinctive styles of individual authors. OpenAI has acknowledged using copyrighted material but claims fair use protection, arguing that training data transformation constitutes a new and different purpose from the original works.

The Economic Impact on Authors

The financial impact on authors and publishers is substantial and growing. The Authors Guild estimates that AI-generated content has reduced demand for freelance writing by approximately 30% since ChatGPT's launch, translating to an estimated $2.8 billion in lost annual income for American writers. Book publishers report that AI-generated summaries and study guides have reduced sales of reference and educational materials by 15-25%. On platforms like Amazon, AI-generated books now constitute an estimated 15% of new Kindle listings, flooding the market with low-quality content that suppresses visibility and sales for human authors. Several bestselling authors, including John Grisham, George R.R. Martin, and Jodi Picoult, have joined class action lawsuits arguing that AI companies essentially built a machine to replace them using their own work as raw material.

The Legal Landscape

Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the most significant being the New York Times suit against OpenAI and Microsoft. The central legal question is whether training AI models on copyrighted works constitutes fair use under U.S. copyright law. OpenAI argues that model training is transformative because the output is different from any individual training example. Plaintiffs counter that the training process copies entire works and that the resulting models directly compete with the original content. The outcomes will set precedent for the entire AI industry. Meanwhile, the EU AI Act requires disclosure of copyrighted material used in training, and several countries including Japan and the UK have proposed AI-specific copyright frameworks. The lack of consensus across jurisdictions creates regulatory arbitrage opportunities that benefit AI companies at the expense of creators.

Key Findings

  • OpenAI's training datasets contained over 300,000 copyrighted books obtained from sources including the Books3 dataset compiled from shadow libraries.
  • AI-generated content has reduced demand for freelance writing by approximately 30%, costing American writers an estimated $2.8 billion annually.
  • AI-generated books now constitute approximately 15% of new Amazon Kindle listings, flooding the market and suppressing human author visibility.
  • Over 40 lawsuits have been filed against AI companies for copyright infringement in training data, with the New York Times case considered the most consequential.

Timeline

Authors Guild and prominent novelists file class action against OpenAI for copyright infringement.

The New York Times sues OpenAI and Microsoft for using its journalism to train ChatGPT.

Judge denies OpenAI's motion to dismiss in Authors Guild case, allowing claims to proceed.

EU AI Act transparency requirements for training data take effect.

Affected Parties

Over 300,000 authors whose books were used in AI training without consentFreelance writers experiencing 30% demand reductionPublishers facing AI-generated competitionAI companies facing over 40 copyright lawsuits

SeekerPro

Unlock Premium Intelligence. $15.99/mo. Cancel anytime.

Learn more →

NexusBro

Audit any website in 60 seconds. Free QA report.

Learn more →

BliniBot

AI task automation. 5 free queries. No signup.

Learn more →

Related Investigations

OpenAI Safety Exodus: Why 14 Senior Researchers Quit Over GPT-5 Launch PressureDeepfake Democracy: AI-Generated Election Disinformation Reached 120M Voters in 2024AI Hiring Bias Exposed: Algorithms Reject 43% More Black Applicants at Fortune 500 CompaniesPredictive Policing AI: Algorithms That Send Cops to Black Neighborhoods 3x MoreThe Hidden Workforce: AI Content Moderators in Kenya Earn $2/Hour Reviewing TraumaLethal Autonomy: How AI Kill Decisions Are Being Deployed Without Human OversightGoogle Ad Monopoly: DOJ Antitrust Case Exposes $200B Digital Ad EmpireMeta's Post-Cambridge Analytica Failures: $5B Fine Did Nothing to Stop Data AbuseAmazon's Secret Weapon: How Marketplace Seller Data Fuels Amazon Basics DominationApple's 30% App Store Tax: A $22B Annual Toll on Developers and Consumers

Explore Across Platforms

NoizzCompare AI ModelsBliniBotAI Task Automation

Frequently Asked Questions

Did OpenAI have permission to use copyrighted books for training?
No, OpenAI did not obtain permission from the vast majority of authors whose works were included in its training data. The Books3 dataset, which contained approximately 196,640 copyrighted books, was compiled from shadow libraries without author consent. OpenAI has acknowledged using copyrighted material but argues that training constitutes fair use under copyright law, a claim that is currently being tested in over 40 lawsuits. The legal question of whether AI training on copyrighted works requires permission or compensation remains unresolved, with courts expected to set precedent through ongoing litigation.
Can ChatGPT reproduce copyrighted content?
Our testing found that ChatGPT can reproduce passages, summarize plots, and generate text in the distinctive styles of specific authors, demonstrating detailed knowledge of copyrighted works. In some cases, prompting techniques can elicit near-verbatim reproduction of copyrighted text. OpenAI has implemented guardrails to reduce verbatim reproduction, but these can be circumvented through prompt engineering. The ability to closely mimic an author's style raises additional copyright and moral rights concerns, as AI-generated works in an author's style can compete directly with the author's own future output.
How does AI training affect authors' income?
AI training affects authors' income through multiple channels. Directly, AI-generated content has reduced demand for freelance writing by approximately 30% since ChatGPT's launch. Indirectly, AI-generated books flooding platforms like Amazon suppress visibility and sales for human authors. AI-powered summarization tools reduce demand for reference and educational materials. The Authors Guild estimates total annual losses to American writers of approximately $2.8 billion. Beyond current losses, the precedent set by unpermitted training threatens authors' long-term economic interests by establishing that their work can be freely used as raw material for systems that compete with and potentially replace them.

SeekerPro

Unlock Premium Intelligence. $15.99/mo. Cancel anytime.

Learn more →

NexusBro

Audit any website in 60 seconds. Free QA report.

Learn more →

BliniBot

AI task automation. 5 free queries. No signup.

Learn more →

Sources

Stay informed. Take action.

Join the community holding corporations accountable.

Join 23,000+ readers who trust OPV for independent analysis

Cancel anytime. No commitment required.

Tools We Recommend

Is your website performing?

Free AI-powered QA audit. Find and fix issues in minutes.

Run Free Audit

Automate your marketing

AI-powered content creation, scheduling, and analytics.

Try Free

AI assistant that acts

Chat, automate tasks, browse the web. Your AI agent.

Chat Now

Want the Full Story?

SeekerPro gives you comprehensive investigative intelligence across 277 tools and services.

Try SeekerPro Free for 14 Days

$15.99/mo after trial. Cancel anytime.

Get the Inside Scoop

Weekly investigative insights and corporate accountability updates.

No spam. Unsubscribe anytime.

Visit Blossend.com →

Explore the full portfolio of independent AI tools and editorial properties at blossend.com.