Landmark AI Ruling: Training on Copyrighted Books is Fair Use

US Judge Sides With AI Company on Copyrighted Training Data

In a significant legal development for the artificial intelligence industry, a United States federal judge has ruled that AI company Anthropic made “fair use” of copyrighted books when utilizing them to train its large language models (LLMs), including its chatbot, Claude.

The ruling by US District Judge William Alsup in San Francisco came in response to a class-action lawsuit filed by a group of authors, including Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson. The authors alleged that Anthropic’s use of their literary works to train its AI models without obtaining consent constituted illegal “large-scale theft.”

Judge Alsup rejected the authors’ central argument regarding the training process itself. He determined that Anthropic’s use of the books to teach its AI system fell within the bounds of US copyright law’s “fair use” doctrine. This doctrine allows limited use of copyrighted materials for purposes such as commentary, criticism, education, or research, particularly when the new use is deemed “transformative.”

What is “Transformative Use” in This Context?

A key factor in Judge Alsup’s decision was the concept of “transformative use.” He accepted Anthropic’s argument that the output generated by the AI after training was “exceedingly transformative.”

Judge Alsup explained his reasoning, stating that “Like any reader aspiring to be a writer, Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different.” He viewed the AI’s learning process as analogous to a human studying existing works to develop their own distinct creative abilities, rather than simply copying or replacing the original content. This perspective suggests that the purpose of using the books for training was fundamentally different from the original purpose of the creative works themselves.

A Split Decision: Piracy Claims Remain

While the ruling on the training aspect is seen as a major victory for AI developers who rely on vast datasets, the decision was not entirely favorable to Anthropic.

Judge Alsup also found that Anthropic’s copying and storage of approximately seven million pirated books in a “central library” did infringe author copyrights and did not constitute fair use. He explicitly stated that the company had “no entitlement to use pirated copies for its central library.”

This means that although using the books for training was deemed fair use in this instance, the method by which Anthropic acquired and stored a large portion of that training data was potentially illegal.

As a result of this split ruling, Anthropic must still face a trial in December specifically focused on the allegations related to the theft and storage of these pirated works. The financial stakes are considerable; based on minimum statutory damages under copyright law ($750 per work), the potential liability for pirating seven million books could exceed $5 billion. This figure looms large when compared to Anthropic’s reported annualized revenue of around $3 billion.

Broader Implications and the Evolving Legal Landscape

This case is being watched closely across the tech and creative industries. It represents one of the first instances where a court has definitively applied the “fair use” doctrine to the core process of training AI models on copyrighted material. It could set a significant precedent for numerous other pending lawsuits against major AI companies like OpenAI, Meta, and Google, which face similar allegations regarding their training data practices.

However, the legal landscape surrounding AI and copyright remains complex and rapidly evolving. This ruling stands in contrast to decisions in other cases, such as Thomson Reuters v. ROSS Intelligence, where another judge rejected a fair use defense for AI training. In that case, the judge found the AI’s use not transformative because the resulting legal research product competed directly with the source material, applying a test from the Supreme Court’s Warhol decision that emphasizes whether the new use serves “substantially the same purpose” as the original.

The differing outcomes in these early cases highlight the ongoing debate about whether AI training should be considered inherently transformative under copyright law, especially when the AI’s ultimate output might compete with the original creators’ work. This legal uncertainty could eventually lead to the issue being considered by higher courts, potentially even the US Supreme Court.

The ruling also underscores the challenge for AI companies in acquiring training data legally. While training might be deemed fair use in some instances, the source and storage of that data are subject to strict copyright rules. This could push AI developers towards exploring licensing agreements with copyright holders as a safer alternative to relying solely on the fair use defense and public or potentially pirated datasets.

The debate over generative AI’s impact on creative fields continues fiercely – questioning whether it will foster new creativity or facilitate the mass production of content that undermines human artists. This ruling provides a nuanced perspective, separating the legality of the training process from the legality of the data acquisition and storage.