AI Training Ruled Fair Use

This week, in Bartz v. Anthropic, Judge Alsup (Northern District of California) ruled that training AI large language models (LLMs) on lawfully acquired works of authorship is fair use.

This is a landmark ruling by the highly respected judge, who handled the Oracle v. Google case.

Infringement claims regarding AI come in two basic flavors: that the act of training is infringement, and that the AI producing output similar to the input is infringement. This ruling is only about the first flavor–the training stage.

Two Acts of Copying

In this case, the defendant purchased copyrighted books, tore off the bindings, scanned every page, and stored them in digitized, searchable files. (This is called destructive scanning, which is faster and easier to do than non-destructive scanning that preserves the original book.) It used selected portions of the resulting database to train various large language models. But Anthropic also downloaded many pirated copies of books, though it later decided not to use them for training. These copies were retained in a digital library for possible future use.

The plaintiffs are authors of some of the books.

Anthropic moved to dismiss the claims based on fair use, and Alsup found the act of training to be transformative, one of the key factors in modern fair use doctrine. Regarding transformation, Alsup cited the Google Books case, one of the key decisions on fair use in the digital age. (Authors Guild v. Google, Inc., 804 F.3d 202, 217 (2d Cir. 2015)).

The Fair Use Analysis

Fair use is analyzed according to four non-exclusive factors set out in 17 USC 107. On the first factor of fair use, the court distinguished between scanning and pirating activities. The court called the destructive scanning of the books a “mere format change,” which supported a finding of fair use. The purpose of the copy was to support searchability. Anthropic only ended up with the digital copies, not the books.

Before buying the physical books, Anthropic “downloaded over seven million pirated copies of books, paid nothing, and kept these pirated copies…even after deciding it would not use them to train its AI.” The court viewed this differently from the scanning: “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.” The court was not convinced by Anthropic’s argument that the use would ultimately be transformative. Citing the recent Warhol case, the order says, “what a copyist says or thinks or feels matters only to the extent it shows what a copyist in fact does with the work.”

The last of the factors in a fair use analysis–usually considered the most important factor–is the effect of the otherwise infringing activity on the market for the original work. The court said, “The copies used to train specific LLMs did not and will not displace demand for copies of Authors’ works, or not in the way that counts under the Copyright Act.” But this was only for the purchased copies; the court reached the opposite conclusion for the pirated copies.

What’s Next?

The case can now proceed to trial only for the pirated copies. For the purchased books that were destructively scanned, the claims were dismissed.

This case is a class action, and the motion for class certification is still pending. If the class is not certified, plaintiffs often give up or settle for small amounts. Law firms that specialize in bringing class actions depend on a class certification of a large class to increase damages, and accordingly, their fees.

There are about 40 pending cases in the US on AI and copyright, and many of them may have suffered a setback with this opinion. Alsup’s opinion is in line with what many copyright commentators (including me) have proposed: that training is lawful if done with lawful access to the training material. The decision of a district court will not bind cases pending in other districts. However, because Alsup is a well-respected jurist, his analysis may persuade other courts to follow suit.

The court did not reach the second flavor of infringement claims regarding output, because it was not at issue here. But many commentators are skeptical that such claims will be successful for properly trained models. ML models typically do not produce “copies” in the sense intended by the copyright law. Claims regarding output may therefore be relegated to trademark, publicity and trade dress claims, which are outside of the ambit of copyright law.

Author: heatherjmeeker

Technology licensing lawyer, drummer

Leave a Reply

Discover more from Copyleft Currents

Subscribe now to keep reading and get access to the full archive.

Continue reading