I’ve started a new video series about software, copyright, and other tech subject I’m thinking about. If you enjoy them, please subscribe.
I am excited to announce that my book Open Source for Business is now available for free download in Chinese.
Many thanks to OpenAtom Foundation, China’s first open source foundation, for preparing the translation and making it available. I gave a virtual talk at OpenAtom’s recent conference, and it is available here (start at 1:53).
Today, OSS Capital is starting a project to collect community input on a definition of Open Weights. We have published our initial effort at https://github.com/Open-Weights/Definition.
Here is the TLDR:
It is critical for the industry to develop and standardize on “Open Weights” licensing frameworks. These frameworks should align closely with the Four Freedoms of free software but should be specifically tailored for Neural Net Weights (NNWs). Recently, my partner and the founder of OSS Capital, Joseph Jacks, posted about this issue, and there was significant interest.
We need a standard for Open Weights that recognizes the unique nature of NNWs and provides legal and practical guidelines for their use, distribution and sharing. This requires collaboration from the entire AI community, including developers, researchers, legal experts, and regulatory bodies.
Also, we do not believe that a definition of Open Weights needs to import subjects such as privacy, human rights, or clearance of data inputs into its licensing principles at this time. We know those are important topics, but they will take time to figure out. We are focused instead on the original idea of openness, and preserving the original goals of Freedom Zero of free software and the non-discrimination principles of open source. We encourage others to develop their own standards for restrictions and ethical licensing, and to participate in the legislative process to set the standards of society for limiting activity to proper use of AI, the information used to train it, and the information it produces.
Also, we applaud those communities who are working on their own definitions. At OSS Capital, we have committed to sponsor the Open Source Initiative’s efforts in this regard, and we hope our efforts will dovetail. But we believe time is of the essence, so we hope our effort will jumpstart collaboration.
We believe that this definition should be developed in the open, much like open source software itself. Therefore this definition and license will be published on GitHub and the community is invited to improve it.
As OSI, we are less concerned with the exact substance of the definition than making sure there is a definition everyone can trust.
Here are some things we considered when creating our draft definition.
- Time is of the essence. We need a definition soon; developers and users alike are struggling because they don’t have one, and they need one, so they can make proper choices about which models to use.
- Keep it simple. We are leaving questions like privacy and ethics to other initiatives. Those are important, but much more complicated, and will take time to work out. We are also not tackling the issue of clearing rights in training data.
- Define both the licensed material and the license. We need to know not only license terms for licensees, but what needs to be disclosed by the licensor to make the licensed material open. This is, we think, the most difficult challenge for creating this definition.
- Get community input. We welcome everyone to comment and make suggestions on GitHub. We hope to help the discussion but not control it.
If you’d like to contribute or open issues, please see our GitHub.
If you’d like to follow the project, you can watch the repository — instructions on how to watch a repo on GitHub.
You can engage in discussion here: https://github.com/Open-Weights/Definition/discussions
Check out FOSSA’s video on my recent talk.
I am thrilled to announce that, with generous support from Mozilla and the Apache Foundation, we have officially launched FOSSDA, the Free and Open Source Stories Digital Archive. It’s time to tell the story of the free and open source movement! This project is now officially underway, thanks to all those who have helped make it happen.
Marc Andreessen famously said that software is eating the world. But the latest and greatest software trend–generative AI–is in danger of being swallowed up by copyright law. Like a cruise ship heading for a scary iceberg, AI is in trouble, and the problems are mostly below the surface.
We now have a pair of lawsuits claiming that GitHub’s Copilot model is stealing open source code from its authors, and that companies using Stable Diffusion or other models (including Stability AI, DeviantArt, and Midjourney) are stealing images from visual artists. Both lawsuits are being prosecuted by Matthew Butterick (best known as the author of Typography for Lawyers) along with the Joseph Saveri Law Firm, a class action firm.
The Co-Pilot lawsuit is widely touted in the press as a copyright infringement case, but in fact it doesn’t claim copyright infringement. It does claim a litany of other wrongs based on torts like removal of copyright information, breach of contract, and fraud. The Stable Diffusion suit is in fact a copyright infringement suit. More importantly, and sadly, these lawsuits are probably a bellwether of more to come.
The Co-Pilot suit is ostensibly being brought in the name of all open source programmers. Yes, that’s right, people crusading in the name of open source–a movement intended to promote freedom to use source code–are now claiming that a neural network, designed to save programmers the onus of re-inventing the wheel when they need code to perform programming tasks, is de facto unlawful. The open source movement is wonderful in many ways, but its tendency to engage in legal maximalism to “protect” open source is sometimes disappointing.
The Stable Diffusion suit alleges copyright infringement, stating that, “The resulting image is necessarily a derivative work, because it is generated exclusively from a combination of the conditioning data and the latent images, all of which are copies of copyrighted images. It is, in short, a 21st-century collage tool.” That characterization is the essence and conclusion of the lawsuit, and one with which many AI designers would disagree.
So, all neural network developers, get ready for the lawyers, because they are coming to get you.
Fair Use or “Fair & Ethical”?
The crux of the problem is that US copyright law, despite many landmark cases, still gives us little or no guidance on how copyright applies to the defense of fair use. The Oracle v. Google case, the biggest fair use case of this century, ambled on a lengthy and astonishingly expensive road to a Supreme Court decision. As Larry Lessig famously quipped, “fair use is the right to hire a lawyer,” and the Supremes proved that true by issuing an opinion that provided little guidance outside of the specific facts of the case.
However you may feel about Google, it’s lucky that Google has the determination and resources to have spent astronomical legal fees defending the right of fair use–from books, to thumbnail photos, to news headlines, to software interface specifications. Users of the web benefit from that. If the AI industry avoids this iceberg, it will be partly because of Google’s historical unwillingness to roll over on fair use cases.
Let’s hope Microsoft (which funded OpenAI and owns GITHUB) has the Google-like intestinal fortitude and money to win this battle. But if Oracle v. Google is any measure, the answer might not come for 10 years, by which time the neural network industry may have been litigated out of existence–or worse yet, limited to those large players who can fund an expensive legal defense. For startups, having a lawsuit hanging over their heads is usually a death knell, between expensive legal bills siphoning off their development resources, and investors shying away from the risk.
Tell Me What You Want, What You Really, Really Want
One perplexing aspect of the lawsuits–and likely all that will follow in its footsteps–is what best practices the plaintiffs actually would want the AI industry to adopt going forward. Butterick says his class action cases are “another step toward making AI fair & ethical for everyone.” But other than netting a hefty fee for the lawyers who bring the suit, what is the endgame, exactly?
Both lawsuits ask for permanent injunctive relief, which would essentially shut down the use of the accused models, but that is part of the playbook for litigation and probably not the result they would prefer. And even for most people who sympathize with the lawsuits, that is not the preferred endgame. Though there are lots of memes out there about Skynet, most people do not want AI to shut down, and if they do, it’s not because of copyright law.
One possible best practice would be to allow authors to specifically opt out of use of their output for ML training. (In fact, Stability has suggested this approach.) This type of approach can work when technical development bumps up against the limits of copyright law. For example, there is a “do not index” mechanism (robots.txt) for web sites that is broadly honored by large scale search engines. But such a convention would have a prodigious backlog to tag, and also, for software authors, prohibiting ML training would be antithetical to the Open Source Definition. So that probably won’t work.
Another possibility is compensation for those who wrote the original material used to train the models. Over the years, there have been various attempts to compensate authors for numerous and small contributions to copyrightable works. This is primarily an information problem, and those who try to solve it usually propose a blockchain based approach, lest payment transaction costs outweigh the compensation. None has been successful yet.
Even if there were a technical solution to the information problem, it would be difficult to allocate compensation to a broad set of creators in a fair way. In the music business, there are artists’ rights organizations like ASCAP and BMI that amalgamate the power to grant blanket music performance licenses to consumers of music, like restaurants that play music over their sound systems. In fact, these rights amalgamation organizations enjoy a limited safe harbor from antitrust law, because they facilitate what would otherwise require millions of small, individual licensing deals.
But this will not work for generative AI. Performance rights organizations reward their authors roughly according to the popularity of their songs. For generative AI, it would be functionally impossible to track which work had been used, because the output is not, in fact, a copy of the original, nor even a collage–but a new work synthesized from a model trained using the original work. If compensation is not tracked to the images actually used, then we would likely see a spate of garbage images being thrown into the mix to grab some of the proceeds. It would be easier to set up a grant fund for artists generally than to track the contributions among millions of artists to a single AI-generated image.
The problem is that neural network models, and their outputs, are not copies of the original works. They are a set of probabilities (weights) that are trained based on thousands or even millions of data points. And at least as of now, it is not possible to look at ML output and determine which inputs, nodes and weights created it. ML, for now, is mostly a “black box” whose inputs and outputs are impossible to connect. In fact, the lack of reproducibility of ML has already been tagged as a social issue: if you build a model that discriminates in its output, how do you audit it? Eventually, the ML industry may solve this problem, but for now, it means there is a usually disconnect between the inputs and outputs, and that probably means that copying could never be inferred in a way that could reliably allocate compensation to the authors of inputs. That, in turn, should mean there is no copyright infringement, but the lawsuits posit otherwise.
Moreover, there is a notice problem. Each of the lawsuits alleges a claim under the Digital Millennium Copyright Act (DMCA) 17 USC §1202(c) of the DMCA) (“CMI”), which prohibits removal of copyright information such as copyright notices. But even assuming that some license notice, or copyright notice, would have to be communicated whenever an AI output was generated, how exactly would that happen? Would each resulting image require thousands or millions of notices? Even now, conventional users of open source code struggle greatly with management and delivery of license notices–anyone who has worked on open source compliance knows how difficult that can be. But these lawsuits make that problem look like child’s play.
If AI Dies, Who Wins?
Both of the Butterick suits are being brought as class actions–a type of lawsuit popularized in the US and still relatively unusual elsewhere. You may have gotten notices from class action lawyers asking you to opt in to a settlement class to which you belong. If you’re like me, you toss them out, because your reward for joining the class will probably be a coupon or a princely settlement of $20.
And so, who benefits from class action suits? Well, class action lawyers. When you hear that a class action suit has resulted in $6 million in damages, the lawyers probably get about $2 million (one-third). Because a class can consist of thousands of members (or in the case of the Butterick suits, probably millions), the damages allocated to the individual class members can be tiny indeed. Sometimes, the lawyers actually get bigger payouts than all the plaintiffs combined. The US class action model has been strongly criticized for being a vehicle for enrichment of plaintiff’s lawyers that provides relatively little real compensation to the plaintiffs they represent. Class action proponents use populist rhetoric and anecdotes to justify their suits, but empirical studies are relatively few, and sometimes, damning. (See for example: https://instituteforlegalreform.com/research/do-class-actions-benefit-class-members/ and https://www.tortreform.com/news/study-class-action-lawyers-often-take-more-money-from-settlements-than-class-members/)
If the AI industry is to survive, we need a clear legal rule that neural networks, and the outputs they produce, are not presumed to be copies of the data used to train them. Otherwise, the entire industry will be plagued with lawsuits that will stifle innovation and only enrich plaintiff’s lawyers. Matthew Butterick has stated that these lawsuits are an attempt to set a precedent in favor of artists, because the law is unclear. Lack of clarity causes people to act conservatively to avoid liability, and that stifles innovation. Given that the courts are unlikely to come up with a common-law rule in this decade, clarity probably needs to come in the form of a legislative amendment to the copyright law. Unless it comes soon, the generative AI industry may be in trouble.
It’s unclear whether Butterick’s suits are mostly a publicity stunt and a ploy for the plaintiff’s lawyers to make a windfall, or a selfless attempt to provide equity for authors, or somewhere in between. But one thing is sure: they will spark a cottage industry for plaintiff’s lawyers, cause crippling expenses for AI developers, and thwart innovation in the generative AI field. As the tech industry celebrates the frothy emergence of machine learning in a time of economic doom and gloom, let’s hope this nascent field doesn’t sink because of the copyright iceberg looming ahead.
Note: Since I started preparing this article for publication yesterday, an additional case was threatened in London by Getty Images regarding Stability AI. Because Getty is the single owner of so many images, and outside the US, this is not a class action suit, and may be more likely to result in a settlement.
Update February 6, 2023: The other shoe drops: Getty filed a complaint in Delaware against Stability AI.
Also, this blog post is a personal opinion, and nothing I have written here should be attributed to any of the parties involved.
A bill was recently introduced in the US Senate, entitled the Securing Open Source Software Act of 2022.
I don’t usually write much about pending legislation, because it often does not ever become law, or changes substantially before it becomes law. This bill is unlikely to be passed this year because of its timing. But it has a few interesting characteristics.
- It is a bipartisan bill, introduced by Gary Peters (D-Mich.) and Rob Portman (R-Ohio, both members of the Senate Homeland Security Committee.
- It defines both “open source software” and “open source software community.”
- It focuses on requirements for software bills of materials (SBOMs), and security concerns, drafting on the Executive Order, from earlier this year, about software security Executive Order on Improving the Nation’s Cybersecurity (EO 14028).
- It establishes “the duties of the Director of the Cybersecurity and Infrastructure Security Agency regarding open source software security” and requires the Director to regularly assess open source software used by the federal government. So it establishes a process, more than substance.
If I had to guess, I would say the bill seems likely to pass in some form, next year. If it does, it’s unclear exactly how it will interact with the recent EO. Also, improved security assessment is good for all software, not just open source — open source security breaches get a lot of press, but all software has potential security issues, and the government should be concerned about its use of proprietary software as well. Finally, to the extent new law establishes requirements for government, or even other customers of software, the private market is mostly ahead of these requirements already. Most software vendors know that customers are already very demanding regarding security requirements. The effect of new law could be to normalize those market demands in private sectors and government — but we will have to wait and see.
About a year ago, I wrote about a copyright case involving fireworks firing codes. This case did not get a lot of attention at the time, and it was yet another example of a plaintiff using copyright law as unexploded ordinance (if you will forgive the pun) to harass its competitors, rather than to protect works of authorship.
Fortunately, the Third Circuit recently vacated a prior injunction in the case, for lack of likelihood of success on the merits, and remanded to the district court with an order to to dismiss the claim with prejudice.
The court analyzed the copyright protection of both Pyrotechnics’ digital message format, and the digital messages created with it. The opinion linked above provides interesting detail on how the messages worked.
The court said, “Pyrotechnics’s digital message format is an uncopyrightable idea and the individual digital messages described in the [copyright registration] are insufficiently original to qualify for copyright protection.” Regarding the message format, it concluded:
Pyrotechnics admits that there is no way for the control panel to communicate with the field module without using the digital message format. Because there are no other “means of achieving the [protocol’s] desired purpose” of communicating with the devices, the digital message format must be part of the uncopyrightable idea and not a protectable expression.Citing Whelan, 797 F.2d at 1236.
As to the messages using the format, the court said, “The digital message format provides rules for constructing messages with particular meanings, and individual messages are generated by applying those rules mechanically.” Because there was insufficient human creativity in creating the format, the messages were not protected by copyright. It further noted that even assuming the messages were creatively produced, there was no creativity in the structure, and ordering, because “using leading header bytes for synchronization and a trailing byte as a cyclic redundancy check are standard communication practices, not creative sequencing.” The court relied on its prior decision in Southco, Inc. v. Kanebridge, 390 F.3d 276, 282 (3d Cir. 2004) (en banc), regarding a numbering system for fasteners, from which the court drew many parallels.
Fighting a David vs. Goliath Copyright Battle
I got a chance to correspond with the owner of FireTEK. Here is what he had to say about the case:
How did you become interested in fireworks systems? How did you learn to engineer them?
The first time I was in the backstage of a fireworks display it was with my system. I started this project because someone asked me if I could do this kind of system to control fireworks, in 2009. I said no a few times, but he insisted, so I decided to try to do it. But then he decided he did not need the system anymore. Also, I had hired some people to do the work, but they quit. So I started researching and learning, and did almost everything myself, and then I became passionate about developing and innovating.
Even though the project started with a lot of problems, now I could see almost all the problems I had, and that helped me make a better product. For example, a main component went out of production and I was forced to find a replacement, but the replacement I found was even better. Because I could not afford someone to do the project, I was forced to learn myself. Even this copyright trial was a learning experience, and a good thing in the end.
Copyright cases can be complicated. What encouraged you to fight the claim?
I run a very small company in Romania. Maybe the plaintiff thought I would not be able to defend myself, but that was not true. This was a difficult and expensive case for me to defend. I was also concerned that, because I was outside the US, I would not get justice from a US court. I knew I had the law on my side, but I didn’t know whether that meant I would win. I kept going, though, because this is my business and I need to protect it.
What do you think tipped the case in your favor?
To be honest I thought it would be an easy win at first, from what I have read about copyright and compatibility. I don’t think the district judge understood the differences between my product and theirs. It seemed to me that the district court opinion mostly copied the plaintiff briefs. That opnion never mentioned my main argument under 17 USC 102(b) and US Supreme court decision in Baker v. Selden. Furthermore, the district court found the work equivalent to an object code and found fixation even though one of the authors clearly states it is “not source code that resides in a computer or in a microprocessor somewhere.” The deposit material for the plaintiff’s copyright was just a simple text created after infringement from memory of someone which was not even listed as author and which briefly describes the protocol.The judge in the appeal understood this. That opinion shows the judge studied it carefully. But I don’t think the district court understood it though I found that hard to believe.
After reading the district court opinion and some other rulings made against me, I was feeling as the main character “The Trial” by Kafka.
What advice do you have for small businesses fighting legal claims?
It will not be an easy fight, but if you know you are right, go to fight. Very important in my case was my involvement in the strategy, briefs, arguments, and of course learning from mistakes. You know your case better than your attorney, so you should read, understand and correct the mistakes your attorney may write in the briefs.
What is next for you in your business?
Even though I won this case my goal is not to get clients with cheaper compatible products. I try to innovate and convince them to buy my product because of all it offers. The compatibility is only an argument to convince the potential client to try my product, because I know they can not switch so easily to another product. It is like you have an entire company which runs a CRM software and you decide to switch to another one if it has some advantages but you find out the cost of the switch is much higher if there is no compatibility to migrate databases for example.
Congratulations to FireTEK for winning a battle against copyright maximalism, just in time for Independence Day!
Here are my latest musings about why open source is so much like ESG — both in legal risk assessment and investment consideration.