Data Scraping Opinion Implications for AI and Open Source Copyright Issues

On May 9, 2024, an opinion was issued in the case of X v. Bright Data by the US District Court for the Northern District of California, on the topic of copyright preemption. On first blush, this opinion is important for what is says about the limited ability of social media sites to prevent data scraping via their terms of service. But it also provides some interesting commentary on the more general issue of copyright preemption.

In this case, X sued Bright Data based on violation of the X terms of service, which prohibit “Misuse of the Services,” and specifically, “scraping the Services in any form, for any purpose.” Such terms have long been common in online terms of service, and have become even more common after the last year’s rush of machine learning model developers scraping public sites for training data.

The decision was issued by Judge Alsup, who famously ruled for Google on the issue of protectability (or lack thereof) of APIs under copyright law. Based on that opinion alone, if nothing else, Alsup is known for his sophistication about technology issues; he famously learned some Java to opine on that case.

Preemption: State Versus Federal Law

Preemption is a key concept in copyright law. Preemption dictates the interaction between US state law and federal law. Copyright is federal law only; the US Copyright Act of 1976 made this crystal clear under Section 301(a). But it says that legal claims “are not preempted if they fall outside the scope of 301(a)’s express preemption and are not otherwise in conflict with the Act.” Ryan v. Editions Ltd. W., Inc., 786 F.3d 754, 760 (9th Cir. 2015).

The policy reason behind this strong statement of copyright preemption was mainly to prevent individual states from making laws creating their own more restrictive, conflicting versions of copyright law. Copyright law is a balancing act: it allows authors exclusive control of certain activities, like copying and distribution of their works, but that power is balanced against the rights of others to use works of authorship in some ways.

Particularly for works like databases and software, copyright protection has many limitations. In the Oracle v. Google case, all of these doctrines came into play: idea/expression dichotomy, merger, short words and phrases, de minimis–and pivotally, fair use. All these limit the power of the author to control certain uses of their works. More restrictive state law–in the form of contract, unfair competition, and similar theories–threaten to rewrite the balance that federal copyright law represents.

Alsup notes in the decision that there are two clauses to Section 301(a)–scope and conflict. State law claims can survive preemption if they deal with something outside the scope of copyright law, such as overloading servers or use of name and likeness. But the statute also refers to conflict. “Although conflict preemption has played second fiddle to express preemption in the caselaw as of late, it is the more appropriate consideration when … enforcement of state law undermines federal copyright law.” Therefore, even if the state law claim is not within the scope of copyright law, conflict preemption can exist when enforcing the contract would be “an obstacle to the accomplishment and execution of the full purposes and objectives of Congress.” Crosby v. National Foreign Trade Council, 530 U.S. 363 (2000), at 373.

Alsup also emphasizes that conflict preemption is particularly important when the contract to be enforced is a standard form contract, as opposed to a contract negotiated between two parties. This kind of one-to-many relationship looks closer to what copyright law was intended to govern, whereas one-to-one contracts allow parties to negotiate a different balance of rights if they desire to do so.

The opinion goes on to list three ways that enforcing a contract prohibiting scraping data would undermine the policy of copyright.

  • Copyright empowers copyright owners to exclude others from reproducing, adapting, distributing, and displaying their copyrighted works. But X did not own the copyright to the user-generated content (UGC) on its site; its terms of service, unsurprisingly, only grant X a non-exclusive license. “X Corp.’s state-law claims based on scraping and selling of data would empower X Corp., as a non-exclusive licensee, to exclude others from reproducing, adapting, distributing, and displaying X users’ copyrighted content”—even though X users licensed their copyrighted content to X to make it freely available. Enforcing the contract would take the power to enforce the copyright away from its true owners.
  • Similarly, enforcing the contract would interfere with the copyright doctrine of fair use, which grants everyone the right to use copyrightable works in ways that encourage creativity and other policy benefits.
  • Last, enforcing the terms would upset the balance of copyright law, which is a “scheme of carefully balanced property rights that give authors and their publishers sufficient inducements to produce and disseminate original creative works and, at the same time, allow others to draw on these works in their own creative and educational activities.” Goldstein on Copyright § 1.14 (3d ed. 2023).

Accordingly, the court ruled that X’s claims under its terms of service were preempted by copyright law, to the extent based on scraping of data.

Sauce for the ML is Sauce for the OSS

This case could have significant implications for the tech world. First, it could create opportunities for those training AI models to scrape content from websites, regardless of contrary prohibitions in the sites’ terms of service. Of course, site operators with treasure troves of data will still have an advantage over scrapers. Even if site owners cannot entirely prevent others from scraping UCG, they can sell preferential access to their APIs. Moreover, the reasoning of the decision might not hold for sites whose content is not primarily UCG.

But second, it could have implications for open source enforcement. For decades, enforcement of open source licenses has been prosecuted under copyright law. The pending SFC v. Vizio case is an attempt to avoid this avenue and bring an action under contract law. In that case, Software Freedom Conservancy brought an action for violation of GPL based on a pure contract theory, seeking specific performance of the contract (i.e. an order to release source code) and not seeking any damages or other copyright remedies. Specific performance is a primarily contract remedy–a rare one at that–and is nearly unheard of under copyright law. The defendant moved to remove the action to federal court based on copyright subject matter and preemption, but lost that battle. The claim was was bounced back to state court, where it currently awaits trial.

The Vizio case, like the X case, is in the 9th Circuit. Like terms of service, open source licenses are one-to-many arrangements, and as in the X case, the plaintiff is not the author. Alsup’s shift of focus to conflict preemption could provide a basis for appeal, or otherwise influence the outcome of that case.

Apple Releases an Almost Open Source AI Model

This week, Apple released an SLM (small language model) called OpenELM, which was touted as open source. It did so under a license that got very close to meeting the Open Source Definition–with the caveat that no such official definition exists yet for AI models.

The license says,

Apple grants you a personal, non-exclusive license, under Apple’s copyrights in this original Apple software (the “Apple Software”), to use, reproduce, modify and redistribute the Apple Software, with or without modifications, in source and/or binary forms…

Except as expressly stated in this notice, no other rights or licenses, express or implied, are granted by Apple herein, including but not limited to any patent rights that may be infringed by your derivative works or by other works in which the Apple Software may be incorporated.

So close, and yet, no cigar. The reservation of patent rights in “derivative works” and the grant under “copyrights” apparently seeks to reserve Apple’s ability to sue for patent infringement for use of the model–or perhaps only for changes to the model. But in any case, it doesn’t grant the full rights necessary to be open source.

This is an unfortunate near-miss. It’s not clear that there actually could be any patent rights embodied in or necessary to use the model itself, given the current state of the law in the US, which does not allow patenting of inventions created automatically without human authorship (which probably includes any machine learning model). Also, it seems unlikely that Apple actually intends to sue anyone for patent infringement for using this model, or for modifying it (at least to the extent any inventions were already embodied in the model). So this is probably a case of cautious drafting for an issue that does not really exist.

But at least it’s not RAIL.

French Court Issues Damages Award for Violation of GPL

On February 14, 2024, the Court of Appeal of Paris issued an order stating that Orange, a major French telecom provider, had infringed the copyight of Entr’Ouvert’s Lasso software and violated the GPL, ordering Orange to pay €500,000 in compensatory damages and €150,000 for moral damages.

This case has been ongoing for many years.

Entr’ouvert is the publisher of Lasso, a reference library for the Security Assertion Markup Language (SAML) protocol, an open standard for identity providers to authenticate users and pass authentication tokens to online services. This is the open protocol that enables single sign-on (SSO). The Lasso product is dual licensed by Entr’Ouvert under GPL or commercial licenses.

In 2005, Orange won a contract with the French Agency for the Development of Electronic Administration to develop parts of the service-public.fr portal, which allows users to interact online with the government for administrative procedures. Orange used the Lasso software in the solution, but did not pass on the rights to its modifications free of charge under GPL, or make the source code to its modifications available.

Entr’Ouvert sued Orange in 2010, and the case wended its way through the courts, turning on, among other things, issues of proof of Entr’Ouvert ‘s copyright interest in the software, and whether the case properly sounded in breach of contract or copyright infringement.

On March 19, 2021, the Appeals Court first rejected Entr’Ouvert’s claims for copyright infringement, saying that the case was a breach of contract claim. The Court of Cassation, which is the supreme court of France, reviewed the case and issued an order on October 5, 2022 overturning the decision of the Court of Appeal. The case was then remanded to the Court of Appeal, which issued its order this week.

The compensatory damages were based on both lost profits of the plaintiff and disgorgement of profits of Orange. Moral damages compensate the plaintiff for harm to reputation or other non-monetary injury.

Note: I patched this information together from various articles, mostly read in translation. The dates in the lawsuit process were inconsistent in those sources. If I find any errors, I will update this post.

Update 2/22/24: Here is the decision in French.

Top 10 Software Events of 2023

The days are getting shorter, the shopping (physical and virtual) is ramping up, Mariah Carey is playing everywhere on loudspeakers, and that means…it’s time for top ten lists. Here are my most memorable software events of 2023. I offer only six, given the government takes 40% of whatever I make.

Red Hat’s License Change That Wasn’t. In June 2023, Red Hat continued on a course to try to limit access to its RHEL software distro. But of course it can’t do that, because…GPL. Red Hat changed its customer agreement, which was widely reported as prohibiting distribution of software–though that was not quite accurate. Many large commercial open source companies that get bought (as was Red Hat, by IBM in 2019), try to increase profit margins by limiting access to open source. They imagine they can convert the downstream “freeloaders” to paying customers. That kind of scheme usually doesn’t work to increase profits, or even sales. But it does work to alienate the community! Red Hat’s unpopular changes to CENTOS enabled alternative distros like Rocky Linux. See my video here.

The Unity Meltdown. In September, 2023, Unity made a change to its pricing model, resulting in death threats and a general indie developer outcry. Then the CEO left, then they fired a bunch of devs. Unity still remains one of the two big dogs in the game engine space, but these missteps pave the way for open source alternatives like Godot. See my video here.

OpenAI’s Revival. November, 2023. I seem to remember this story about a guy who died and rose three days later. But Sam Altman needed about five. Maybe the less said about this debacle the better, given it’s dominated the news ever since Thanksgiving. One day, the entire tech world woke up to realize that it’s genAI darling unicorn, OpenAI, is really a non-profit run by AI-doomers (or guardians of humanity, depending on your viewpoint). Microsoft nearly accomplished the world’s biggest reverse-acquihire, but then the AI-doomers were kicked off the board. Sam and the AI-boomers will take the company forward, but OpenAI is still saddled with a weird corporate structure, and a lot of technical debt. I’m looking forward to seeing other companies eat OpenAI’s lunch in 2024. See my video here.

17 Lawsuits About GenAI. Good heavens, it’s exhausting just to list them, much less explain how nonsensical most of them are. Most of these will fail, if the judges actually follow existing copyright law. If content generators what to prohibit machines from reading their books/music/pictures, they need to get congress to change the law. Pro tip: when it comes to copyright, congress usually does whatever the media industry wants.

Feds’ Continuing Vendetta Against Tech (Including Crypto). The US government is still going after crypto. SBF of FTX was convicted. CZ of Binance pled guilty of money laundering. The SEC threatened to sue Coinbase unless it stopped trading all crypto other than Bitcoin. (Coinbase declined.) But there’s more. The Feds continue to file (mostly unsuccessful) lawsuits against tech giants claiming novel antitrust theories. See my videos on SBF and antitrust here.

What Didn’t Happen: Open Source AI. We still don’t have a definition of open source AI, and efforts to define it are stalled by the disarray of the open source community. Meanwhile, OpenAI and others are trying to set a narrative that openness is not necessary. Now *that’s* scary. (This was the subject of my TED Talk in September, but it’s still not published yet. I will update this blog post when (or if) it comes out. An article with similar substance is here.)

Happy new year, everyone!

The video for this post is here.

OpenAI–What Happened?

This week, the tech world was shocked by the sudden and unexplained firing of Sam Altman, the now-former CEO of OpenAI.

OpenAI has been the darling of the tech industry for the last year. Its current fundraising goals would put it on track to be one of the biggest unicorns in the US. But here are a few reasons why OpenAI probably can’t live up to its own hype. The events of this past weekend only underscore them.

My video on this topic is here.

Technical Debt, or at Least Technical Debt-for-Equity

OpenAI made a big splash by releasing ChatGPT3 in late 2022, and followed with updates this year. But OpenAI did not invent the core transformer technology behind GPT–that was originally from a core concept by Google. That’s why so many companies have been training up their own models lately. The barriers to entry in LLM development right now aren’t about technology, they are about resources. OpenAI trained ChatGPT3 and 4 using a lot of data, a lot of money and a lot of compute cycles.

The problem with this is that ChatGPT has a thin first mover advantage. Each iteration of a model takes tons of resources to produce. If OpenAI continues to draft on existing tech, and keeps building new models, it will be in a never-ending race that will require immense capital and resources, without significant economies of scale. That’s not sustainable.

Lack of Transparency and Weird Organization

The spat between the company’s board and its investors is very weird indeed. In most companies, the investors control the board. Boards and investors usually don’t have spats, particularly not public ones. But OpenAI is no ordinary company.

OpenAI’s parent entity is a non-profit called OpenAI, Inc., with a for-profit subsidiary called OpenAI Global, LLC. That is a bizarre structure for a tech startup. Non-profits don’t have shareholders, they have a board of directors. In contrast, the board of directors of a profit company is elected by its shareholders. The structure of control of an LLC is what we lawyers call “flexible”–which means opaque and idiosyncratic. But it is usually run more like a for-profit corporation.

To see how this odd structure happened, you need to read between the lines. In 2015, the original contributors for the non-profit pledged about $1 billion to the project, but many did not fulfill their pledges. So, in 2019, OpenAI transitioned from a non-profit to “capped for-profit” by creating a subsidiary LLC with shareholder profits capped at 100 times any investment. Just for reference, a “capped-for-profit” is not really a thing in corporate law, but with an LLC, anything goes. OpenAI then got an investment from Microsoft, for $1 billion, into the for-profit subsidiary, along with a deal for access to Microsoft’s cloud computing services. It then announced its intention to commercialize its products. But the for-profit entity is controlled by the non-profit.

This transition troubled initial contributors, and generated criticism. It’s kind of a joke in the tech business today that OpenAI as a company name is beyond ironic. It’s not open at all. In fact, open source advocates have generally viewed Altman and OpenAI as trying to set a narrative that avoids transparency in AI. They are pushing for regulation, to forestall demands for transparency. OpenAI has justified its move away from transparency due to its need to compete–but that’s circular logic.

Good Old-Fashioned Over-Valuation

OpenAI’s latest funding round is set to value the company at $80 billion. That’s bigger than the market cap on lots of existing public companies, including Boston Scientific and Mercedes Benz. Could it be worth that much?

It’s probably fair to say that OpenAI is the only company making significant money on large language models at the moment. (Though GitHub’s Co-Pilot is in the running, too, it’s currently based on ChatGPT.) Most companies that have released LLMs have not monetized them directly. In fact, LLMs are probably difficult to monetize at a rate that will be profitable over time–at least not based on the current data and resource-guzzing tech.

In this year of tech business malaise, AI has been the only bright spot. The conventional wisdom for 2023 is that no company can raise money, except AI startups, which are swimming in investment dollars. As usual, when private investors jump on a bandwagon, they tend to fund some terrible businesses. OpenAI is the superstar of startups today, but superstars have not fared well in recent years, tending to burn out or fade away.