AI Training Copyright: Anthropic Wins Scan-Book Fair Use, Faces Piracy Trial

ai-training-copyright-anthropic-wins-scan-book-fa-685b2970dfc5d

Landmark AI Ruling: Anthropic Can Scan Purchased Books, But Pirating Is Illegal

In a significant decision for the rapidly evolving field of artificial intelligence and copyright law, a tech-savvy U.S. federal judge has issued a mixed ruling regarding how AI company Anthropic trained its Claude large language model (LLM) using book content. While the court found that scanning legitimately purchased books for training constitutes fair use, it simultaneously ruled that downloading and retaining millions of pirated copies is not protected and opens the door for a high-stakes trial on damages.

The ruling, handed down by Judge William Alsup of California’s Northern District court, is seen as a partial victory for Anthropic in a lawsuit brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson. The authors alleged that Anthropic illegally used their copyrighted fiction and non-fiction works, which were found within pirated datasets the company utilized.

Scanning Purchased Books: A “Transformative” Fair Use

According to the court documents, Anthropic employed two primary methods to acquire book data for training its AI. One method involved purchasing millions of physical books, many second-hand. The company then digitized this content by cutting up and scanning the pages, destroying the original print copies in the process.

Judge Alsup ruled that this specific practice qualifies as “fair use” under current U.S. copyright law. His reasoning hinged on the transformative nature of the use: by scanning the books and destroying the originals, Anthropic was not creating a mere duplicate or substitution of the original work. The digital input was used to train an AI model, whose output was deemed original and not a reproduction of the authors’ specific creative elements or style. The judge noted that this process did not infringe upon copyrighted materials because the AI’s output, while drawing upon grammar, composition, and style learned from vast datasets, does not replicate a given work’s creative core or an author’s identifiable expressive style. Anthropic was pleased with this aspect of the ruling, quoting the court’s description of the training use as “transformative — spectacularly so.”

This finding offers a potential legal pathway for AI developers seeking to train models on print materials, provided they adhere to a rigorous process involving legitimate acquisition and transformation of the original copies.

The Piracy Problem: Millions of Illegally Obtained Books

However, Anthropic’s second method of data acquisition presented a major legal hurdle. The company downloaded over 7 million pirated copies of books from sources known as “pirate libraries,” including the Books3 dataset, Library Genesis (Libgen), and the Pirate Library Mirror (PiLiMi).

Judge Alsup explicitly ruled that the downloading and building of a central digital library from these pirated books was not legally justified by fair use. The court found that Anthropic downloaded these copies and retained them, even acknowledging internal concerns about using pirated material for “legal reasons.” Despite these concerns, the pirated copies were kept, which the judge determined was done for “Anthropic’s pocketbook and convenience.”

Crucially, the judge denied Anthropic’s request for summary judgment on the issue of these pirated copies. This means Anthropic may now face a trial specifically over the use and retention of the 7 million-plus illegally obtained books and the resulting damages.

The High Stakes: Potential Damages Exceeding $5 Billion

The upcoming trial carries significant financial implications for Anthropic. Under copyright law, minimum statutory damages for infringement can be as low as $750 per work. With an estimated seven million pirated books involved, the potential statutory damages could theoretically exceed $5 billion. This figure looms large, especially when compared to Anthropic’s recent annualized revenue, reported to be around $3 billion. The ruling also clarified that subsequently purchasing a legitimate copy of a book initially obtained through piracy “will not absolve it of liability for the theft,” though it might influence the extent of statutory damages.

Wider Implications for AI Data & Copyright

This split decision by Judge Alsup is seen as highly influential, potentially shaping the legal landscape for other AI companies facing similar copyright challenges. It marks one of the first instances where an AI company’s fair use defense for the training process itself has succeeded in court.

The case highlights the tension between the AI industry’s need to “hoover up” vast datasets – often compiled without explicit consent or notice to copyright holders – and existing copyright and data protection laws. While the court accepted the transformative use argument for legally acquired content, it drew a clear line against the unauthorized acquisition and retention of pirated material.

Many other lawsuits are ongoing across the industry, from authors and journalists to artists and photographers, all grappling with the use of their creative works in training AI models. Cases involving Getty Images vs. Stability AI and artists vs. Midjourney/Stability AI underscore the widespread nature of these legal battles. The secrecy surrounding the specific data used by many commercial AI developers further complicates these issues, making it difficult for creators to know if their work has been used.

The Judge’s Perspective

Judge Alsup is uniquely positioned to rule on such complex tech matters. Known for his technical expertise, including decades of coding experience and presiding over landmark tech trials like the Oracle-Google fair use case and the Anthony Levandowski trade secrets case, his rulings carry significant weight and are often respected for their depth of understanding of the underlying technology.

While Anthropic secured a win on the transformative fair use of legitimately scanned books, the cloud of potential liability for the massive trove of pirated data remains. The upcoming trial will further define the boundaries of fair use and copyright protection in the age of powerful, data-hungry AI.

References

    1. <a href="https://www.theregister.com/2025/06/24/anthropicbookllmtrainingok/”>www.theregister.com
    2. gizmodo.com
    3. san.com
    4. www.theguardian.com
    5. <a href="https://www.theregister.com/2008/02/29/bmrfeargalsharkey/?page=2″>www.theregister.com

Leave a Reply