AGPL In the Light of Day

Recently, Open Core Ventures posted a blog entry called “AGPL license is a non-starter for most companies.” It commented:

Companies fear the AGPL because of the risk of a developer combining code in a way that would require them to open source parts of their code base that were not intended to be open source. It restricts a user’s ability to adapt the code to their needs to the point where it gives exclusive commercial rights to the original copyright holder.

The blog made some interesting points, but I disagree with the conclusion, so I wanted to add some of my own perspective.

AGPL has come a long way. About a decade ago, I wrote an article called “AGPL–Out of the Shadows.”* That article identified some of the same issues with AGPL adoption mentioned in the OCV blog, but mentioned a budding trend toward increased understanding and acceptance. After all these years, it’s time for an update on how AGPL is used in the 2020s. Actually, AGPL has emerged the license of choice for commercial open source software (COSS) companies in application space.

When is a Ban not a Ban?

The fear of AGPL is still out there. Google still bans AGPL. Most companies don’t publish their open source usage policies, but Google does, so its policy is one of the few public data points about corporate open source policy, and lots of other companies pay attention to that. Lots of companies still place AGPL on their “stop” list, (see the example policy here) because they worry they don’t have the internal controls to comply with it. That’s a conservative approach, and there’s nothing wrong with being careful.

But the “stop” category in a license policy is not the same as a ban. Having worked with hundreds on clients on open source compliance in my law practice over the years, I’ve seen first-hand how this works. It means ad hoc human approval is needed to use software under the license. That approval does create friction in adoption. But what balances against that friction is a great product.

This has always been how copyleft licenses work. Copyleft is complicated. There is a non-zero price to understanding anything complicated. For GPL, the killer app was Linux. It took nearly 20 years for companies to calm down about GPL and adopt Linux. For AGPL, the killer app was originally MongoDB. (In the 2010’s, in my legal practice, almost every approval of AGPL software in a corporate “stop” list were for MongoDB.) Products drive adoption, not licenses. Users don’t adopt licenses, they adopt software. The license simply represents a tax–in the economic sense–on its use. In other words, the conditions of the license are the cost of using the software. And for years, AGPL didn’t have enough killer apps to make many corporate users develop compliance processes for its conditions.

So, the friction OCV described exists, but it’s not entirely about the license. It’s about the balance between the complexity of AGPL with the attractiveness of software products using it.

Applications Thrive as Open Source

What’s changed since 2016 is an explosion in COSS development. At OSS Capital, we have embraced that phenomenon. But we believe the licenses serve the products, not vice-versa. To us, open source development is a huge business advantage, and that advantage can be gained with any open source license.**

Since we started our fund, many excellent COSS businesses have chosen AGPL as part of their licensing and business strategy. At this moment, it’s probably fair to say that Grafana is the “killer app” of AGPL in the business world. But there are many, many startups choosing AGPL. And in this difficult economy, their success is truly amazing. That’s no accident, because open source adoption wins a lot of business during recessionary and inflationary periods. Below is a list of only some of the companies using AGPL very successfully from OSS Capital’s own network. If AGPL does not work, that’s news to all these companies.

So, why does AGPL work for all these companies, if it is so scary? First, AGPL is not really all that scary. As I alluded to back in 2016, the main problem with AGPL was that it was relatively new, and different from other open source licenses, and required compliance processes that most companies had never implemented. Over time, adopters have become less fearful about AGPL. Compliance processes have improved. Recent focus on software BOMs and security have hauled most companies into open source compliance, because the same tools usually track both needs. So these days, it’s easier to track AGPL code in an organization.

But more importantly, the rise in adoption of AGPL in COSS has tracked the rise of COSS in applications. That’s a relatively recent development. Open source originally thrived in the basic computing stack. Permissive licenses like Apache or MIT work great for infrastructure software. Copyleft, in general, is more problematic for IT managers adopting infrastructure software, because they need flexibility to make significant changes to integrate infrastructure code, and don’t want to wade into analyzing copyleft licensing requirements. Consider Confluent, Redis, Elastic (back in their Apache days)–the list goes on. All of those infrastructure tools grew up under permissively licensed cores.

But application space is different. From the use point of view, the copyleft compliance requirements in application space are less troubling than in infrastructure. Applications are usually stand-alone processes, not libraries or tools. That means the scope of copyleft requirements (one “Program”) is not so difficult to figure out. Also, most users of applications today are accustomed to using SaaS instead of installed software. That means if you choose copyleft, you need a network copyleft license, so AGPL is the only real choice.

CLAs and Loopholes

OCV, and many others, have pointed out a “loophole” that allows vendors of AGPL applications to grant alternative commercial licenses. But that “loophole” merely describes the limits of what an open source model can do. Copyleft licenses don’t limit or condition the use of software by its authors.

In privately funded business, AGPL is almost always used as part of a dual licensing strategy. That means the software is available under AGPL, but if you don’t want to comply, you can buy an alternative license from the vendor. In that case, vendors almost always use a contribution license (CLA) for contributions from the community. Otherwise, the vendor sometimes can’t sell alternative licenses, because the contributions are encumbered with AGPL conditions. For an explanation about how this works, see my video here.

There is a lot of FUD about CLAs, but at the end of the day, if a contributor doesn’t want to allow her contributions to an open source code base to be used in a vendor’s commercial products, the license allows her to fork the code as a pure AGPL project. But of course, that means someone has to fund the maintenance of that fork. Given private companies sink millions of dollars, not to mention years of their founders’ lives, into maintaining open core products, contributors often feel that signing a CLA is a reasonable quid pro quo for that commitment of capital and sweat equity. Those who object to CLAs, at the end of the day, mostly object to privately funded open source development as a general proposition. And that’s a personal choice for contributors.

For a COSS developer, the choice is between a permissive license, like Apache/BSD/MIT, and AGPL. Most in-between choices eventually migrate to one pole or the other. Infrastructure is mostly under permissive licenses, and applications are mostly under AGPL.

Don’t Believe Me, Just Watch

In any case, try out some of the great products that are thriving under the AGPL/dual-licensing model! The proof of the pudding license is in the eating using.

*That link shows the article as authored by the “Synopsys Editorial Team” but if so, they used a time machine and a LLM to write an article in exactly my style. But seriously, I wrote it, and as far as I know, I’m not on their staff. It’s pretty rare that anyone tries to take credit for my writing– most authors would not want to take the heat for the things I say!

**Well, not exactly. Non-standard or superseded licenses like Common Public License don’t work so well. But that’s about standardization, not substantive license terms. COSS businesses usually need to choose one of the “big 6”: AGPL, GPL, LGPL, BSD, MIT, Apache.

OpenAtom Foundation, and My Book in Chinese

I am excited to announce that my book Open Source for Business is now available for free download in Chinese.

Many thanks to OpenAtom Foundation, China’s first open source foundation, for preparing the translation and making it available. I gave a virtual talk at OpenAtom’s recent conference, and it is available here (start at 1:53).

Toward an Open Weights Definition

Today, OSS Capital is starting a project to collect community input on a definition of Open Weights. We have published our initial effort at https://github.com/Open-Weights/Definition.

Here is the TLDR:

It is critical for the industry to develop and standardize on “Open Weights” licensing frameworks. These frameworks should align closely with the Four Freedoms of free software but should be specifically tailored for Neural Net Weights (NNWs). Recently, my partner and the founder of OSS Capital, Joseph Jacks, posted about this issue, and there was significant interest.

We need a standard for Open Weights that recognizes the unique nature of NNWs and provides legal and practical guidelines for their use, distribution and sharing. This requires collaboration from the entire AI community, including developers, researchers, legal experts, and regulatory bodies.

Also, we do not believe that a definition of Open Weights needs to import subjects such as privacy, human rights, or clearance of data inputs into its licensing principles at this time. We know those are important topics, but they will take time to figure out. We are focused instead on the original idea of openness, and preserving the original goals of Freedom Zero of free software and the non-discrimination principles of open source. We encourage others to develop their own standards for restrictions and ethical licensing, and to participate in the legislative process to set the standards of society for limiting activity to proper use of AI, the information used to train it, and the information it produces.

Also, we applaud those communities who are working on their own definitions. At OSS Capital, we have committed to sponsor the Open Source Initiative’s efforts in this regard, and we hope our efforts will dovetail. But we believe time is of the essence, so we hope our effort will jumpstart collaboration.

We believe that this definition should be developed in the open, much like open source software itself. Therefore this definition and license will be published on GitHub and the community is invited to improve it.

As OSI, we are less concerned with the exact substance of the definition than making sure there is a definition everyone can trust.

Here are some things we considered when creating our draft definition.

  • Time is of the essence. We need a definition soon; developers and users alike are struggling because they don’t have one, and they need one, so they can make proper choices about which models to use.
  • Keep it simple. We are leaving questions like privacy and ethics to other initiatives. Those are important, but much more complicated, and will take time to work out. We are also not tackling the issue of clearing rights in training data.
  • Define both the licensed material and the license. We need to know not only license terms for licensees, but what needs to be disclosed by the licensor to make the licensed material open. This is, we think, the most difficult challenge for creating this definition.
  • Get community input. We welcome everyone to comment and make suggestions on GitHub. We hope to help the discussion but not control it.

If you’d like to contribute or open issues, please see our GitHub.

If you’d like to follow the project, you can watch the repository — instructions on how to watch a repo on GitHub.

You can engage in discussion here: https://github.com/Open-Weights/Definition/discussions

FOSSDA Launches to Celebrate FOSS Month

I am thrilled to announce that, with generous support from Mozilla and the Apache Foundation, we have officially launched FOSSDA, the Free and Open Source Stories Digital Archive. It’s time to tell the story of the free and open source movement! This project is now officially underway, thanks to all those who have helped make it happen.

Press release here.

Is Copyright Eating AI? 

Marc Andreessen famously said that software is eating the world. But the latest and greatest software trend–generative AI–is in danger of being swallowed up by copyright law. Like a cruise ship heading for a scary iceberg, AI is in trouble, and the problems are mostly below the surface. 

We now have a pair of lawsuits claiming that GitHub’s Copilot model is stealing open source code from its authors, and that companies using Stable Diffusion or other models (including Sta­bil­ity AI, DeviantArt, and Mid­jour­ney) are stealing images from visual artists. Both lawsuits are being prosecuted by Matthew Butterick (best known as the author of Typography for Lawyers) along with the Joseph Saveri Law Firm, a class action firm.

The Co-Pilot lawsuit is widely touted in the press as a copyright infringement case, but in fact it doesn’t claim copyright infringement. It does claim a litany of other wrongs based on torts like removal of copyright information, breach of contract, and fraud. The Stable Diffusion suit is in fact a copyright infringement suit. More importantly, and sadly, these lawsuits are probably a bellwether of more to come. 

The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful. The open source movement is wonderful in many ways, but its tendency to engage in legal maximalism to “protect” open source is sometimes disappointing.

The Stable Diffusion suit alleges copyright infringement, stating that, “The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool.” That characterization is the essence and conclusion  of the lawsuit, and one with which many AI designers would disagree.

So, all neural network developers, get ready for the lawyers, because they are coming to get you. 

Fair Use or “Fair & Ethical”?

The crux of the problem is that US copyright law, despite many landmark cases, still gives us little or no guidance on how copyright applies to the defense of fair use. The Oracle v. Google case, the biggest fair use case of this century, ambled on a lengthy and astonishingly expensive road to a Supreme Court decision. As Larry Lessig famously quipped, “fair use is the right to hire a lawyer,” and the Supremes proved that true by issuing an opinion that provided little guidance outside of the specific facts of the case. 

However you may feel about Google, it’s lucky that Google has the determination and resources to have spent astronomical legal fees defending the right of fair use–from books, to thumbnail photos, to news headlines, to software interface specifications. Users of the web benefit from that. If the AI industry avoids this iceberg, it will be partly because of Google’s historical unwillingness to roll over on fair use cases.

Let’s hope Microsoft (which funded OpenAI and owns GITHUB) has the Google-like intestinal fortitude and money to win this battle. But if Oracle v. Google is any measure, the answer might not come for 10 years, by which time the neural network industry may have been litigated out of existence–or worse yet, limited to those large players who can fund an expensive legal defense. For startups, having a lawsuit hanging over their heads is usually a death knell, between expensive legal bills siphoning off their development resources, and investors shying away from the risk.

Tell Me What You Want, What You Really, Really Want

One perplexing aspect of the lawsuits–and likely all that will follow in its footsteps–is what best practices the plaintiffs actually would want the AI industry to adopt going forward. Butterick says his class action cases are “another step toward mak­ing AI fair & eth­i­cal for every­one.” But other than netting a hefty fee for the lawyers who bring the suit, what is the endgame, exactly? 

Both lawsuits ask for permanent injunctive relief, which would essentially shut down the use of the accused models, but that is part of the playbook for litigation and probably not the result they would prefer. And even for most people who sympathize with the lawsuits, that is not the preferred endgame. Though there are lots of memes out there about Skynet, most people do not want AI to shut down, and if they do, it’s not because of copyright law.

One possible best practice would be to allow authors to specifically opt out of use of their output for ML training. (In fact, Stability has suggested this approach.) This type of approach can work when technical development bumps up against the limits of copyright law. For example, there is a “do not index” mechanism (robots.txt) for web sites that is broadly honored by large scale search engines. But such a convention would have a prodigious backlog to tag, and also, for software authors, prohibiting ML training would be antithetical to the Open Source Definition. So that probably won’t work.

Another possibility is compensation for those who wrote the original material used to train the models. Over the years, there have been various attempts to compensate authors for numerous and small contributions to copyrightable works. This is primarily an information problem, and those who try to solve it usually propose a blockchain based approach, lest payment transaction costs outweigh the compensation. None has been successful yet.

Even if there were a technical solution to the information problem, it would be difficult to allocate compensation to a broad set of creators in a fair way. In the music business, there are artists’ rights organizations like ASCAP and BMI that amalgamate the power to grant blanket music performance licenses to consumers of music, like restaurants that play music over their sound systems. In fact, these rights amalgamation organizations enjoy a limited safe harbor from antitrust law, because they facilitate what would otherwise require millions of small, individual licensing deals. 

But this will not work for generative AI. Performance rights organizations reward their authors roughly according to the popularity of their songs. For generative AI, it would be functionally impossible to track which work had been used, because the output is not, in fact, a copy of the original, nor even a collage–but a new work synthesized from a model trained using the original work. If compensation is not tracked to the images actually used, then we would likely see a spate of garbage images being thrown into the mix to grab some of the proceeds. It would be easier to set up a grant fund for artists generally than to track the contributions among millions of artists to a single AI-generated image.

The problem is that neural network models, and their outputs, are not copies of the original works. They are a set of probabilities (weights) that are trained based on thousands or even millions of data points. And at least as of now, it is not possible to look at ML output and determine which inputs, nodes and weights created it. ML, for now, is mostly a “black box” whose inputs and outputs are impossible to connect. In fact, the lack of reproducibility of ML has already been tagged as a social issue: if you build a model that discriminates in its output, how do you audit it? Eventually, the ML industry may solve this problem, but for now, it means there is a usually disconnect between the inputs and outputs, and that probably means that copying could never be  inferred in a way that could reliably allocate compensation to the authors of inputs. That, in turn, should mean there is no copyright infringement, but the lawsuits posit otherwise.

Moreover, there is a notice problem. Each of the lawsuits alleges a claim under the Digital Millennium Copyright Act (DMCA) 17 USC §1202(c) of the DMCA) (“CMI”), which prohibits removal of copyright information such as copyright notices. But even assuming that some license notice, or copyright notice, would have to be communicated whenever an AI output was generated, how exactly would that happen? Would each resulting image require thousands or millions of notices? Even now, conventional users of open source code struggle greatly with management and delivery of license notices–anyone who has worked on open source compliance knows how difficult that can be. But these lawsuits make that problem look like child’s play.

If AI Dies, Who Wins?

Both of the Butterick suits are being brought as class actions–a type of lawsuit popularized in the US and still relatively unusual elsewhere. You may have gotten notices from class action lawyers asking you to opt in to a settlement class to which you belong. If you’re like me, you toss them out, because your reward for joining the class will probably be a coupon or a princely settlement of $20.

And so, who benefits from class action suits? Well, class action lawyers. When you hear that a class action suit has resulted in $6 million in damages, the lawyers probably get about $2 million (one-third). Because a class can consist of thousands of members (or in the case of the Butterick suits, probably millions), the damages allocated to the individual class members can be tiny indeed. Sometimes, the lawyers actually get bigger payouts than all the plaintiffs combined. The US class action model has been strongly criticized for being a vehicle for enrichment of plaintiff’s lawyers that provides relatively little real compensation to the plaintiffs they represent. Class action proponents use populist rhetoric and anecdotes to justify their suits, but empirical studies are relatively few, and sometimes, damning. (See for example: https://instituteforlegalreform.com/research/do-class-actions-benefit-class-members/ and https://www.tortreform.com/news/study-class-action-lawyers-often-take-more-money-from-settlements-than-class-members/

Don’t Die

If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them. Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers. Matthew Butterick has stated that these lawsuits are an attempt to set a precedent in favor of artists, because the law is unclear. Lack of clarity causes people to act conservatively to avoid liability, and that stifles innovation. Given that the courts are unlikely to come up with a common-law rule in this decade, clarity probably needs to come in the form of a legislative amendment to the copyright law. Unless it comes soon, the generative AI industry may be in trouble.

It’s unclear whether Butterick’s suits are mostly a publicity stunt and a ploy for the plaintiff’s lawyers to make a windfall, or a selfless attempt to provide equity for authors, or somewhere in between. But one thing is sure: they will spark a cottage industry for plaintiff’s lawyers, cause crippling expenses for AI developers, and thwart innovation in the generative AI field. As the tech industry celebrates the frothy emergence of machine learning in a time of economic doom and gloom, let’s hope this nascent field doesn’t sink because of the copyright iceberg looming ahead.

Note: Since I started preparing this article for publication yesterday, an additional case was threatened in London by Getty Images regarding Stability AI. Because Getty is the single owner of so many images, and outside the US, this is not a class action suit, and may be more likely to result in a settlement. 

Update February 6, 2023: The other shoe drops: Getty filed a complaint in Delaware against Stability AI.

Also, this blog post is a personal opinion, and nothing I have written here should be attributed to any of the parties involved.