X Sues Over Open Source “Exfiltration”

This week, a new lawsuit cropped up relating to corporate open source releases. 

X Corp. v. Yao Yue and IOP Systems, ND Cal, filed 12/4/2025

X, the-micro-blogging-service-formerly-known-as-Twitter, sued Yao Yue, a former engineer at X, alleging theft of proprietary source code. The facts sprung from the fraught events surrounding Elon Musk’s purchase of the company in 2022. The complaint linked above sets out the allegations in detail.

“Yue had been making repeated requests to open-source certain X Corp. data, as well as made comments that she had exfiltrated X Corp. source code to benefit her new company. …[A]fter Yue’s termination, Yue began contacting [X Director of Performance Engineering] Ms. Strong, asking her to push through a project that had been underway at the time of the Musk Acquisition. The aim of the project was to open-source certain data logs, with the purported goal of educating the broader technology community as to the performance of systems in a company of X Corp.’s scale. Prior to the Musk Acquisition, the project was going through the normal process for approval.  Yue wanted Ms. Strong to “nudge” the project along, claiming that it was simply waiting for someone to sign off on the open-source designation.”

Later, Yue “bragged about how she and other former X Corp. colleagues had exfiltrated X Corp. source code needed to start their own venture, IOP Systems.” 

The complaint alleges that an article in The Verge quoted Yue as an anonymous source, and that “after her termination, Yue used the service elevator to sneak into X Corp.’s San Francisco, CA office and purportedly gather personal belongings,” but used the opportunity to “exfiltrate to a USB drive 6 million lines of X Corp.’s proprietary and confidential source code from her company-issued laptop.”

The software or data in question was developed by X’s Redbird group, which focuses on infrastructure technology.

An academic paper co-authored by Yue described a tool, “Latensheer,” which can “predict end-to-end latency of a complex internet platform.” The paper stated that “LatenSeer is open-sourced at: https://github.com/yazhuo/LatenSeer, and the Twitter traces will be released upon legal approval.” 

As of this writing, the repository is still online–which is interesting, given that if it is infringing, I would expect a DMCA takedown request to have been issued. I checked GitHub’s repository of takedown requests and did not find anything about this repository.

The Open Source Skunkworks

This points up a troubling trend for technology companies for some time: employees push their company to release software under open source licenses, or release the software without company authority, and then quit, created a new company, and use the released software to build competing products.

This happened most spectacularly with NGINX–or at least, that is what a 2020 lawsuit claimed. For some details, see my post here: https://heathermeeker.com/2020/07/23/lawsuit-alleges-nginx-conspiracy/. The lawsuit was later dismissed, but its complaint told quite a tale, so I recommend reading the original complaint, if only for entertainment value.

The skunkworks problem illustrates why it is key to have a process for corporate open source releases. Here, because X had an approval process, there will be a smaller likelihood of a dispute over whether the open source release was actually authorized or not. Companies without formal policies can end up in finger pointing disputes, with potential defendants claiming they had the right to use the software because it was released under an open source license, and the company claiming it did not authorize the release.

The X lawsuit claims misappropriation of trade secrets, and related claims like violation of the California Computer Data Access and Fraud law, and unfair competition, but not, notably, copyright infringement. So it is not clear whether the LatenSeer code is alleged to be infringing.

What Happens Next

The allegations in the complaint are only the plaintiff’s version of events at this point, not proven. The defendants will likely respond by denying the allegations, and the suit will plug along the way lawsuits do.

Investing in the Red Zone: Commercial Open Source and the Bear Market

Note: This is an article from 2020 whose original link has broken. I’ve posted it here for continuity.

When business is on an upward trajectory, investing is not so hard. After all, the stock market always goes up over the long term. But it takes more work to identify good investments during down markets. Fortunately, commercial open source software (COSS) businesses can be a great investment during bad times.

There is plenty of anecdotal evidence for this premise. First, the demand curve. Linux was getting popular before 2001, but its popularity skyrocketed during the Internet bust. That shows the power of COSS on the demand side, and it fits a classic demand curve analysis. When times are bad, and profits are down, buyers turn to lower cost goods. COSS is often developed as a substitute for more costly proprietary software. IT managers tasked with cutting budgets turned to it in earnest beginning with the downturn of 2001.

But the economic profile of COSS is also about the supply side. In difficult times, the companies that use capital most efficiently survive. COSS companies are fundamentally more efficient at running on a less capital, and that makes COSS one of the most interesting investments in a down market.

COSS companies leverage capital efficiently for many reasons. First, consider the people who write open source software (OSS). Today, even though OSS is heavily underwritten by industry, an OSS project is still usually the brainchild of an individual or small team who came up with the idea on their own and had no top-down direction to create it. So, the roots of most projects are still in the garage. That means the labor to make the initial development sprint is usually a volunteer effort. This can play out in different ways. Perhaps an engineer starts a side project while employed doing something else. Perhaps an engineer expends time during slow periods, or while between jobs, to create a resume trail, network, or prepare for the next opportunity. The cost here is the engineer’s sweat equity–definitely not a zero cost. But it is, undeniably, an efficient cost. No expensive office lease, no free lunches, no swag. Just work.

Once a project is underway, it gets a slew of free marketing advice from adopters who vote with their feet. Downloads are not dollars, but downloads can tell you a lot. Is the project going in the right direction to meet market needs? Is it reliable? Is it structured correctly? Are its goals and its value properly communicated to the community? As they say, criticism is a gift. All this feedback is a trial by fire. Any project that comes out on the other end of its initial pipeline alive has been road tested in the most ruthless way imaginable.

There are also efficiencies in ongoing maintenance, but this can be a red herring. Lots of people focus on this most obvious benefit of COSS–that the world is your maintenance and support team. But that is the bean-counter viewpoint, and the real efficiency has more to do with the costs of a mature project versus a nascent one. It’s also misleading. In truth, most OSS projects are primarily maintained by their core committer team, and the value of community input is primarily in feedback: bug reporting, feature requests, and evaluation. In fact, projects like Linux that have a wide and active community of committers vying for PRs is the exception, not the rule. So if you find one of those, it’s probably a great investment. But until that unicorn comes along, there are plenty of projects with great potential that don’t “outsource” their support to the community.

If we take all the above as a given, a COSS company makes ridiculously efficient use of resources in its early stages. Now, suppose you are an investor looking for your highest long-term multiple. Consider that a COSS company does not usually get formed on day one of this process. It usually gets formed after all this initial honing has taken place. So, if you have a choice between investing in a fledgling COSS company, and a proprietary company, that choice is simple. The proprietary company will be using your capital for initial development, feature definition, and road testing–not to mention the financing roadshow. The COSS company already has a foothold on its product and market. So, at a minimum, investing in COSS companies takes place at a better inflection point than for fully proprietary companies.

But that is theory, and now we have to road test the hypothesis: do COSS companies survive downturns well? The analysis below suggests that the answer is a resounding: Yes.

To investigate this proposition, I looked at approximately 50 COSS companies. These included notable exits and companies with a notable business. 

Then I identified the major down markets of the last 30 years. Our working assumption was that development would have started in the 12 months prior to first release. So, I included in the RED ZONE companies who released in, or within 1 year after, a down market. 

The RED ZONE shows companies whose initial development took place during one of the downturns identified below.

The results speak for themselves–most of the biggest COSS were built on software developed during a downturn–even if we eliminate the Linux distros. Of the companies considered, over 50% of them were started during a recession. The table appears below. 

So, I am excited for whatever comes next. If the market is great, there will be lots of winners. If not, I will be following the winners.

The exceptions are also notable: the wave of acquisitions by Oracle in the 1990s and 2000s, Kubernetes, Docker, and several Apache projects during the past 7 years since the recession of 2008.  

Now, to be scientific, I would have to also pick a control group, but I haven’t done that. To be candid, this is like one of those clinical trials where they stop the control group for moral reasons once the initial data comes in. I don’t need to know if proprietary companies would do as well–I don’t think that’s likely. But even if so, the more efficient capital use by COSS companies would motivate us. Capital, thoughtfully applied to COSS businesses in bad times, is a big countercyclical advantage.

A note on methodology:  The selection of 50 companies was to some degree arbitrary, in the sense that I applied a few rules to identify COSS companies according to our definition.  I did not include companies like Facebook or Google, which contribute greatly to open source development, but whose primary business is not open source development. (In other words, primarily proprietary companies with open source activities, rather than vice-versa. For more on this distinction see www.chinstrap.community. I excluded cryptocurrencies–because their products are not primarily based on providing software, companies that sold only proprietary versions of open source software written by others, and a few others whose business was too complex to map–positively or negatively–to the analysis. Other methodology notes appear in the table.

Is AI the re-Democratization of the Web?

For a few years now, the news has been full of prognosticators screeching about the dangers of AI. And while some of it is potentially concerning, we all know that the news tends to lean into the catastrophic. So, I’ve been thinking about one aspect of the advent of AI that might actually be great – at least for the time being.

Once upon a time, the web was a level playing field. I remember my delight in being able to use algorithmic search results. In those results, even small webpages sometimes came up before big ones.

Then the commercialization of search started–and never stopped.

Don’t get me wrong, there were some things about the commercialization of search that were great. The theory was that people who were willing to pay to show search results typically had more resources and therefore offered better products or more interesting information. And those who complain about targeted ads have surely forgotten the early days where every ad was for Viagra.

Once Upon a Query

For a while, search engines like Google clearly separated algorithmic and paid search results–whereas some search engines leaned more heavily into paid results without identifying them as paid. And each of us used the search engine that fit our needs best. I was an Altavista fan until it got acquired by Yahoo and mothballed. Altavista was the algorithmic search engine beloved by nerds everywhere.

But eventually, paid search took over the web experience. These days, you can’t even search for information about hotels without getting an entire page of results from aggregators–so much so that the official sites of the hoteliers are actually hard to find. And don’t get me started about trying to file government documents; the actual government sites are buried in a slew of ads by charlatans who want to charge you money to file something that is usually just as easy to file yourself.

For Now, AI is Better

Now, recently, we’ve seen some hue and cry in the press about AI taking over search. Let me remind you that, a few years ago, the same hue and cry was about videos taking over search. All these articles seemed to imply that anything taking over search was a danger, because (reading between the lines) search yielded up purer, more factual, or less brain-rot results. These articles bemoanthat the golden days of search were over, and possibly that Google’s ad-related business model is doomed–though given the Google-hating so common in media, it wasn’t clear why that was supposed to be a cause for alarm. 

Recently, OpenAI announced a browser called Atlas. Again, the alarm bells sounded for the death of search.

Then I started thinking, is that really a bad thing? When I ask AI a question, the AI answers based on what it knows. And mostly, it knows facts, not the potential for ad revenue. I also get web links as references in the answer. Those references seem to be more like the old days of search, where information took precedence over advertising.

Here’s an example, I searched for a flight to Samarkand. With Google Search, the entire first page was paid results. It found Turkish Air, which was good, but the first hit was Delta.

Now, Delta and Lufthansa are not the best flight to anywhere, in my experience, but guess what? Delta–the top result–apparently doesn’t even go there.

Meanwhile, Claude gave me a lot of useful information. But even AI is at the mercy of what is on the web, so it pointed me to an aggregator instead of an airline.

And so, exactly who is surprised that AI is replacing search? I mean, AI is helpful, but the problem is that search is broken. 

Waiting for the Other Shoe to Drop

Now the question is: where will the search ads go? What will be the next business initiative to divert my attention from what I want to see, to what advertisers want me to see? Ads aren’t in AI results yet, because the AI providers are getting paid for using their models. In that sense, Google search is more like the old over-the-airwaves TV model: the service is free, but the ads pay for it. Now, for AI, we seem to be in the equivalent of the early streaming days: pay for the service, but no ads. But we all know what happens next: pay for the service, and see ads, as well.

Meanwhile, let’s enjoy this time, which we might later look back on as a golden age of ad-free AI search results.

Amicus Brief in Thomson Reuters v. Ross

I am excited to announce that I filed an amicus brief in this case (about which I wrote a while ago). The case is on interlocutory appeal to the Third Circuit on topics of protectability of legal headnotes under copyright, and fair use of legal headnotes in AI training. My brief is focused on protectability.

On a personal note: This is the first time I’ve ever filed an amicus brief–or any brief–and the process was a learning experience for me. Writing the argument was fun, but for me, that was only the beginning. It was truly a 90/10 rule. In the end, with the help of the excellent team at Counsel Press, I was able to get it filed.

I look forward to the court’s eventual decision on this case.

Anthropic Settling AI Class Action

Of all the many pending lawsuits about AI and copyright, the Anthropic class action has been blazing trails in the US courts. The case is still not precisely over, but apparently heading toward settlement.

Update as of 9/5/25: Under the proposed settlement, Anthropic will pay about $3,000 for each of about 500,000 books used from pirate sites, for a total of at least $1.5 billion. “All works in the Class are treated the same in this settlement, entitled to the same pro-rata amount of the Settlement Fund, reflective of the per-work statutory damages remedy authorized by the Copyright Act itself. The allocation for each Class Work will be calculated by dividing the total amount of the Settlement Fund (less fees and expenses) by the total number of Class Works.”

And in case you were wondering, this summary was done almost entirely with Claude, Anthropic’s LLM, with minimal editing. So…

Don't believe me,  just watch!

Background

Case: Bartz v. Anthropic PBC, Case No. 3:24-cv-05417 (N.D. Cal.)

Court Docket: CourtListener.com

Plaintiffs: Three named plaintiffs – Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson – filed a class action lawsuit against Anthropic.

The Training: Anthropic downloaded over seven million books from pirate sites and digitized millions of purchased print books to build a “central library of “all the books in the world'” to support the training of its large language models. Specifically:

  • Anthropic used millions of copyrighted books to train its Claude LLMs for use with its AI services capable of generating writings that mimic the writing style of humans.
  • Millions of books were downloaded from shadow library sites like Pirate Library Mirror and Library Genesis and stored in a central repository that Anthropic employees could access for model training and internal research.
  • The Court relied in part on an internal Anthropic email in which an employee was tasked with obtaining “all the books in the world” while avoiding as much “legal/practice/business slog” as possible.

Legal Claims

The plaintiffs claimed Anthropic infringed their copyrights by (1) pirating copies of their works for Anthropic’s library and (2) reproducing their works to train Anthropic’s LLMs. The authors argued that use of their books to train Anthropic’s LLMs could result in the production of works that compete and displace demand for their books and that Anthropic’s unauthorized use has the potential to displace an emerging market for licensing the plaintiffs’ works for the purpose of training LLMs.

Procedural Facts

  • Filed: August 19, 2024
  • Complaint: Bartz et al. v. Anthropic PBC – 3:24-cv-05417
  • Judge: U.S. District Judge William Alsup of the Northern District of California
  • Motion: Anthropic moved for summary judgment on an asserted defense of fair use. Judge Alsup issued a mixed decision on June 23, 2025, granting summary judgment on some issues while denying it on others.
  • Class Action Status: The case was certified as a class action on July 17, 2025.
  • Interlocutory Appeal: Anthropic filed a Rule 23(f) petition seeking interlocutory appeal of Judge Alsup’s class certification in Bartz v. Anthropic.
  • Motion to Stay: Anthropic moved to stay the case, pending its seeking of Rule23(f) interlocutory appeal of his class certification. Judge Alsup denied Anthropic’s request to stay the case on August 11, 2025.
  • Notice of Settlement and Joint Stipulation for Stay was filed August 25, 2025 indicating hte parties were close to a settlement.
  • Order re: Settlement. The case is stayed for the parties to file a settlement by September 5, 2025.

The June 23 Summary Judgment Order:

Granted Summary Judgment (Fair Use Found):

  • Training LLMs: The court concluded that use of the books at issue to train Anthropic’s LLMs was “exceedingly transformative” and a fair use under Section 107 of the Copyright Act. Judge Alsup wrote that the “‘purpose and character of using works to train LLMs was transformative – spectacularly so'” and described it as “quintessentially transformative.”
  • Digitizing Purchased Books: The Court concluded this was fair use because the new digital copies were not redistributed, but rather, simply, convenient space-saving replacements of the discarded print copies.

Denied Summary Judgment (Not Fair Use):

  • Pirated Books: The court found that downloading and copying pirated books for its library was not fair use. Because Anthropic never paid for the pirated copies, the court thought it was clear the pirated copies displaced demand for the authors’ works, copy for copy.

Fair Use Analysis

  • Factor 1 (Purpose/Character): The court noted that authors cannot exclude others from using their works to learn. It noted that, for centuries, people have read and re-read books, and that the training was for the purpose of creating something different, not to supplant the works
  • Factor 4 (Market Effect): The Court found that the copies used by Anthropic to train LLMs did not (and will not) displace demand for the authors’ works. The court dismissed concerns by analogizing to complaining that “training schoolchildren to write well would result in an explosion of competing works”
  • Partial Victory: Upon weighing all the fair use factors, the Court granted Anthropic’s summary judgment motion for fair use as to the training of LLMs and the digitization (format change) of legally purchased works. The Court, however, denied summary judgment relating to pirated copies and ordered a trial on that issue and any related damages.
  • Trial Scheduled: The court wrote that “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness)” Trial was Scheduled for December, 2025.
  • Potential Damages: Depending on how many titles were involved, Anthropic’s potential liability could reach into the billions.

PHP License Metamorphoses to BSD

The PHP project announced it is moving to a new license.

PHP is a scripting language used for web development. It can be embedded within HTML and used to create dynamic web pages. It is the “P” in the LAMP stack (Linux, Apache, MySQL and PHP)–although some people use the P to refer to Python or PERL. The Zend Engine, the core of PHP.

For years, PHP as a whole has been offered under the PHP License and the Zend Engine License–both permissive licenses. The PHP License is OSI approved, but the Zend Engine License is not. The Zend Engine License has specific naming restrictions related to “Zend” and “Zend Engine,” sometimes referred to as advertising clauses or attribution clauses. Such restrictions were common in early permissive licenses like Apache 1.0, but have since been deprecated by the open source community and do not appear in most recent permissive licenses.

License changes can be a challenge, because unless a project uses a contribution license agreement (CLA), it must get permission from all contributors to change the license for their contributions. In major projects, this has been done a few times, such as the Wikipedia migration and the OpenSSL change, but it’s a big project that can require broad socialization and risk that a contributor will object. These changes usually take place with popular projects whose licenses are outdated, ad hoc, and confusing.

But PHP has found a neat trick to avoid having to get permission from every contributor. Like many open source licenses, the PHP license allows the license steward to issue new versions.

  5. The PHP Group may publish revised and/or new versions of the
     license from time to time. Each version will be given a
     distinguishing version number.
     Once covered code has been published under a particular version
     of the license, you may always continue to use it under the terms
     of that version. You may also choose to use such covered code
     under the terms of any subsequent version of the license
     published by the PHP Group. No one other than the PHP Group has
     the right to modify the terms applicable to covered code created
     under this License.

Apparently, PHP, as license steward, is redefining its own license as the BSD license. The announcement says that the BSD License will be adopted as the PHP License v.4 and as the Zend Engine License v. 3.

Meta Wins Partial Summary Judgment in AI Infringement Claim

On the heels of the landmark judgment in favor of Anthropic this week, a judge in another pending AI copyright case, Kadrey v. Meta, ruled for the defendants.

Thirteen authors, including most notably Sarah Silverman, sued Meta for using their copyrighted books, downloaded from “shadow libraries,” to train its large language model (Llama). The court explained, “A shadow library is an online repository that provides things like books, academic journal articles, music, or films for free download, regardless of whether that media is copyrighted.” The most notorious of these is called The Pile.

Even though Judge Chhabria ruled for the defendants, the language of his opinion was extremely favorable to the plaintiffs. The court said, for example: “[B]y training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.” This statement points to the final and most important factor of fair use–effect on the market for the original work–and suggests that, if the case were argued correctly, this factor would weigh in favor of infringement.

The plaintiffs had argued that Llama could reproduce snippets the text of their works, and that Meta’s unauthorized training diminished their ability to license works for AI training. However, the court stated that “Llama is not capable of generating enough text from the plaintiffs’ books to matter, and the plaintiffs are not entitled to the market for licensing their works as AI training data.”

Keep in mind that this same judge had stated in a pervious hearing on this case, “I understand your core theory. Your remaining theories of liability I don’t understand even a little bit.” https://www.reuters.com/legal/litigation/us-judge-trims-ai-copyright-lawsuit-against-meta-2023-11-09/

The court implicitly lamented that the plaintiffs did not assert sufficient facts to withstand summary judgment, noting, “Because the issue of market dilution is so important in this context, had the plaintiffs presented any evidence that a jury could use to find in their favor on the issue, factor four would have needed to go to a jury.”

The court strongly hinted that similar cases could benefit from better advocacy. “As for the potentially winning argument—that Meta has copied their works to create a product that will likely flood the market with similar works, causing market dilution—the plaintiffs barely give this issue lip service, and they present no evidence about how the current or expected outputs from Meta’s models would dilute the market for their own works.” This is what one might call a playbook for bringing a more successful claim.

Given the state of the record, the Court has no choice but to grant summary judgment to Meta on the plaintiffs’ claim that the company violated copyright law by training its models with their books. But in the grand scheme of things, the consequences of this ruling are limited.

This particular case is not quite over yet. But removing the infringement claims is a significant win for the defense.

It may be no coincidence that this case came on the heels of Judge Alsop’s opinion only days ago. The order in this Meta case referred specifically to Judge Alsop’s opinion, disagreeing with some of his fair use analysis.

AI Training Ruled Fair Use

This week, in Bartz v. Anthropic, Judge Alsup (Northern District of California) ruled that training AI large language models (LLMs) on lawfully acquired works of authorship is fair use.

This is a landmark ruling by the highly respected judge, who handled the Oracle v. Google case.

Infringement claims regarding AI come in two basic flavors: that the act of training is infringement, and that the AI producing output similar to the input is infringement. This ruling is only about the first flavor–the training stage.

Two Acts of Copying

In this case, the defendant purchased copyrighted books, tore off the bindings, scanned every page, and stored them in digitized, searchable files. (This is called destructive scanning, which is faster and easier to do than non-destructive scanning that preserves the original book.) It used selected portions of the resulting database to train various large language models. But Anthropic also downloaded many pirated copies of books, though it later decided not to use them for training. These copies were retained in a digital library for possible future use.

The plaintiffs are authors of some of the books.

Anthropic moved to dismiss the claims based on fair use, and Alsup found the act of training to be transformative, one of the key factors in modern fair use doctrine. Regarding transformation, Alsup cited the Google Books case, one of the key decisions on fair use in the digital age. (Authors Guild v. Google, Inc., 804 F.3d 202, 217 (2d Cir. 2015)).

The Fair Use Analysis

Fair use is analyzed according to four non-exclusive factors set out in 17 USC 107. On the first factor of fair use, the court distinguished between scanning and pirating activities. The court called the destructive scanning of the books a “mere format change,” which supported a finding of fair use. The purpose of the copy was to support searchability. Anthropic only ended up with the digital copies, not the books.

Before buying the physical books, Anthropic “downloaded over seven million pirated copies of books, paid nothing, and kept these pirated copies…even after deciding it would not use them to train its AI.” The court viewed this differently from the scanning: “Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.” The court was not convinced by Anthropic’s argument that the use would ultimately be transformative. Citing the recent Warhol case, the order says, “what a copyist says or thinks or feels matters only to the extent it shows what a copyist in fact does with the work.”

The last of the factors in a fair use analysis–usually considered the most important factor–is the effect of the otherwise infringing activity on the market for the original work. The court said, “The copies used to train specific LLMs did not and will not displace demand for copies of Authors’ works, or not in the way that counts under the Copyright Act.” But this was only for the purchased copies; the court reached the opposite conclusion for the pirated copies.

What’s Next?

The case can now proceed to trial only for the pirated copies. For the purchased books that were destructively scanned, the claims were dismissed.

This case is a class action, and the motion for class certification is still pending. If the class is not certified, plaintiffs often give up or settle for small amounts. Law firms that specialize in bringing class actions depend on a class certification of a large class to increase damages, and accordingly, their fees.

There are about 40 pending cases in the US on AI and copyright, and many of them may have suffered a setback with this opinion. Alsup’s opinion is in line with what many copyright commentators (including me) have proposed: that training is lawful if done with lawful access to the training material. The decision of a district court will not bind cases pending in other districts. However, because Alsup is a well-respected jurist, his analysis may persuade other courts to follow suit.

The court did not reach the second flavor of infringement claims regarding output, because it was not at issue here. But many commentators are skeptical that such claims will be successful for properly trained models. ML models typically do not produce “copies” in the sense intended by the copyright law. Claims regarding output may therefore be relegated to trademark, publicity and trade dress claims, which are outside of the ambit of copyright law.

Postmodern Art and Cannabis Law

I’m intrigued as to why an article about cannabis law cites to an article I wrote over thirty years ago about copyright fair use in postmodern art. But the Tulsa Law Review has paywalled their prestigious journal, forcing me to pay if I want to find out, and honestly, I’m not quite that intrigued.

By the way, you can download my article for free, and there is even an update here.

AI Could Be Your Next Team for Clean Room Development

Clean room developments are necessary when a developer wants to “cleanse” the intellectual property burden of third party software. The need arises when third party software is provided under unacceptable license terms, or not licensed at all. This is one of the trickiest tasks in software development, but it has a long history of best practices.

The canonical clean room development seeks to avoid trade secrets of proprietary software. But the rise of open source has resulted in the need to do a different kind of clean room project, meant to avoid the copyright in open source software–usually for GPL licensed packages. The two situations call for a slightly different approach. A clean room process for proprietary code seeks to avoid trade secrets and copyright burdens, whereas clean room development in open source is entirely about copyright–because there are no trade secrets in open source software. In either case, a team of developers seeks to write new implementing code from scratch, so that code will perform the same tasks, with the same inputs and outputs, as the original or “target” code.

A traditional clean room development process looks something like this:

  • Separate Development Teams: Create two teams of developers: a specification team that works on specification development for the target code, and an implementation team that writes the new implementing code. 
  • Create a Specification: The specification team, which has access to the target code, extracts the specifications for the software’s requirements and expected behavior. Software, at the end of the day, is a set of inputs and outputs, and its specifications state what outputs you should expect when certain inputs are used.
  • Reimplementation: The implementation team writes the new software according to the specification developed by the specification team. This must be done in an environment that is “cleansed” of the target code. Ideally, the implementation code has never read the target code.
  • Verification: The development team tests the newly implemented clean code. If there are bugs, the specification team can only confirm the accuracy of the specification. The specification team cannot suggest bug fixes, because that might result in inadvertent copying. Bug fixes are done by the implementation team.
  • Iterate: Repeat until the development is done.

Of course, there are far more complex processes for clean room development. Some have three teams, and most have a lot more steps. I have seen guidelines so many pages long they have a table of contents. But the above is the essence–not to mention the most my clients have the patience to read.

Not Enough Humans

The problem most companies have when performing a clean room development is that they don’t have the resources to create two separate teams. Even if they do, they usually cannot create an implementation team that has never been exposed to the target software–and doing so is particularly difficult when the target software is open source, because there is no way to prove lack of access to publicly available materials. For an open source clean room process, we usually make do with developing implementing code in an environment that does not have local access to the target code.

But now, with the advent of AI, we have an alternative way to approach clean room development.

I pause here to note that while there are those who think that all generative AI is prima facie copyright infringement, I don’t agree. As long as the model has been trained on enough inputs, it should not parrot any one input. (More on that here.) So let’s set that issue aside, because if you disagree with me, you shouldn’t be using AI coding tools at all and you should just put this article aside.

An AI that writes code (like Claude or Co-Pilot) has probably been exposed to almost all the open source code ever written. But via that training process, it is unlikely to focus on specific target code. So, companies struggling to staff a clean room development might consider replacing one or both of the teams with AI.  As always, some human oversight is necessary to check that an AI generative process has been done correctly. But using it would still greatly reduce the headcount necessary to implement the clean room process.

  • Specification Team. AI is better at some tasks than others, but I have found that AI is quite good at summarizing text. If you ask it to write the specifications for target software, it will probably do a good job. You could use an AI for your specification team, and that would help avoid “contaminating” your implementation team with access to the target software.
  • Implementation Team. AI is quite good at writing code, though more human oversight would probably be necessary to use AI for this purpose. AI-assisted coding still requires human curation, and also usually requires human debugging. Debugging is a complex logical task, and the current flavor of AI–transformation based models–are better at text generation than logic. But in a pinch, you might use AI as your implementation team and use the specification team for quality control.

Neither of these suggestions should be surprising. AI code generation greatly reduces the human effort necessary to produce code, and clean room projects are human-intensive. For an open source target, I think the use of AI as a specification team is quite interesting. For proprietary code, using AI as the implementation team may be particularly interesting, because AIs are mostly not trained on proprietary code, making the cleansing more reliable.

Always remember: wash your hands before you code!