Policy for Generative AI Best Practices
Updated May 17, 2023
Twenty-five years ago, when I started practicing in the area of open source licensing, there was a flurry of doubt about the legality and safety of using open source software. At the time, some reacted by prohibiting use of open source in business settings—a stricture that was roundly ignored by most developers. Today, we see the same phenomenon with generative AI. Everyone is using it, claims have been raised about its legality, and no one knows quite how to assess the risks.
Using generative AI (such as Jasper for images, or ChatGPT or Copilot for code) is certainly not without risk. However, some of these tools have the capacity to increase productivity by 50%-100%, so prohibiting their use also has real and immediate costs. Therefore, best practices in this area should take into account the balance of risk and reward. In this policy, the objective is to manage risk rather than eliminate it.
A Machine Learning (ML) model is a set of weights (probabilities) that constructs output, such as images, text or music. It is trained using input data, usually of the same kind. However, ML models do not actually store the input data on which they are trained. For more on the technology, see a description of the technology behind ChatGPT here and the Electronic Freedom Foundation’s analysis of AI and ML here. Understanding the technology is key to understanding copyright issues. Descriptions of generative AI in the press can be misleading. Because most people find the technology opaque, characterization of the technology directly affects their perception of risk.
There are two main sources of risk:
- Input Risk. Data on which models are trained are offered under a broad variety of licensing terms or no licensing terms. It is therefore tricky to decide what data to select when training a model. This risk mostly applies to developers of AI tools. Additional risks include interpretation of “no AI” metadata in online content or restrictive terms of service on websites, which may conflict with posted license terms. The input risk inures to AI developers rather than users.
- Output Risk. Output generated by the model can appear similar to the data on which it was trained. This has caused public claims that ML models infringe copyrightable works used to train them. This risk applies to both developers and users of the model, with users in the primary path of liability and developers potentially liable for indirect infringement (which is subject to the Betamax doctrine that makers of tools with substantial non-infringing uses are not liable).
Similarity does not imply infringement. That is converse logic. Similarity is an element of copyright infringement but not a sufficient condition for infringement. Copyright infringement requires copying, and access plus substantial similarity is evidence of copying. However, similarity can occur for reasons other than infringement, and ML is a quintessential example of how this happens. Copyright law does not limit, for example:
- the use of functional elements of a copyright (17 U.S.C. § (102b))
- generic elements of styles or genres (scene-a-faire doctrine)
- short words and phrases
Copyright is also subject to the defense of fair use. (17 U.S.C. § 107 and case law). However, disputes concerning such issues can be long, complex, and expensive to litigate. For example, the Oracle v. Google case, which treated similar issues, took ten years and over $100 million in combined legal fees. Fair use is a fact-specific inquiry that makes outcomes hard to predict.
Regarding the input risk, use of content under many “free culture” licenses like open source software license or Creative Commons expressly allow activities involved in ML training. What may be unclear is what license conditions apply to the use or distributions of the model or its outputs. There are also licensing gray areas, such as CC-NC (Creative Commons Non-Commercial), because it’s unclear whether ML training would violate its non-commercial limitation.
Note that the most notorious claim currently in the area of generative AI inputs for software, the class action accusing Copilot, is not a copyright infringement claim. However, others have been brought alleging copyright infringement and will probably continue to be brought in the years to come. Other cases as of this writing include:
- Midjourney and Stability AI: Class action regarding training images
- Stability AI v. Getty Images: Using images to train Stable Diffusion
- Defamation suit against ChatGPT (Note that defamation law varies significantly across jurisdictions.)
- OpenAI and Meta: Class action regarding books
There is unlikely to be a clear answer to these IP risk questions for some years. However, the likelihood that there would then be a legal resolution on any specific tool that does not clear the rights to use the output of the tools is probably small. ML model developers and users alike will have many useful defenses against infringement claims. Some of the defendants in the first claims (Microsoft, Open AI, etc.) have significant resources to litigate and may be willing to press the issue for years to a conclusion.
Taking into account the enormous value of these tools weighed against unclear risks, companies may reasonably decide to use the tools but should consider best practices to manage their risk.
Accordingly, below are best practices for using generative AI.
- Only use tools from reputable sources trained on large data sets. Responsibly trained models are less likely to be infringing. You should expect a developer to be able to explain how they chose the input data and why they think that choice was lawful. Also, training with a larger input data set reduces the likelihood of infringement because the output is more likely to be generic. At this time, training is expensive, requiring significant computing power, so larger companies are more likely to have the resources to train on large data sets at this time. (This will change soon as new technologies develop that allow less expensive training.)
- Do not rely exclusively on the tools. Where appropriate, use the tools to assist but not completely replace human authoring activities. Editing the output will often make it better quality and less likely to be infringing. For example, you might use ChatGPT to create a first draft but then take a free hand editing the results before further use. This practice will also increase the chance that your output will be protectable under copyright, as the US Copyright Office currently will not register works produced by AI without human intervention.
- Use the tools for generic output. The tools are less likely to produce infringing content when used to create generic output that can easily be replaced. Do not instruct tools to create material that is likely to be or appear infringing of copyright or trademark (e.g., “A picture of Pikachu in a compromising situation”).
- Use the tools for ephemeral output. Non-persistent outputs like answers to search results queries are less likely to result in high liability. Output intended for persistent or heavy reuse is more risky.
- Segregate output where feasible. Discrete files like images and music should be tagged and separated from files not generated with AI tools.
- Turn on all filtering. Many AI products are beginning to offer filtering options. For example, GitHub’s Copilot offers a feature that allows filtering of inputs and will soon offer a feature to reference matches to the training set. Opt in to any such filtering in AI tools that is designed to reduce liability.
- Do not use tools on sensitive input. Most free online tools do not protect the confidentiality of your input, and this can have negative effects when used on sensitive material, including descriptions of patentable inventions or other confidential information. Paid accounts with tools providers often have more protective terms, and of course, running your own open source tools can avoid disclosing your input and output to others. Commercial tools are being developed to screen information being fed to chatbots, and these may help keep tabs on employee use of the tools.
If input processes are deemed infringing, tool makers will likely react quickly by creating new models trained on less risky data. If output is deemed infringing, users would have an opportunity to react and eliminate the infringement. How users would do that varies when the output is images, text or source code.
- For images or music, users would swap out any potentially infringing output with new files (possibly regenerated with a non-infringing tool or just stock images or music). Therefore, users should be sure to track which files were generated with AI tools. These files usually do not contain metadata that allow users to easily identify them, so users should mark or segregate them accordingly. This is the easiest category to address because music and images files are discrete and less likely to be integrated closely with other files than text. Check images for any indicia of third-party ownership, such as trademarks, and do not use any that contain such content.
- For source code, users would do a scan with a tool like Black Duck. Remediation will be more challenging than for images or music because output of the tool will likely be interspersed with other material in the development code base. Any content that looks infringing could be addressed by removing it, rewriting it, or—for code that is available under a permissive open source license —adding a license notice where needed. While this can be laborious, it is already a common process in code audits; development code bases often contain snippets of infringing code when copied from sources like Code Project or Stack Overflow that need to be culled due to license incompatibility.
- For non-software text (such as marketing collateral), remediation would be much more challenging and perhaps unfeasible other than rewriting from scratch. This output will not be in identifiable, discrete files. Infringing text may be impossible to locate or excise systematically. Users should keep track of anything written using the tools that cannot be easily replaced. This category’s risk is tempered by the fact that much of the text generated in this category may be generic and have no intrinsic market value, which will strengthen the fair use argument and discourage claims.
Keep in mind there are other issues than infringement, such as privacy, rights of publicity, regulatory issues with use of data (e.g., HIPAA), defamation, and security–not discussed here. Also, AI models can produce inappropriate, erroneous or offensive output, and human review may be necessary to manage those risks.
Disclaimer: This information is provided by Heather Meeker for your information only. It does not create an attorney-client relationship. Ms. Meeker and her law firm, Tech Law Partners LLP, will not be obligated to you for the accuracy or quality of this information. Law and best practices in this area are rapidly evolving, so this document may become outdated quickly. Adhering to these practices will not insulate you from liability or claims. Your risks will depend on your specific facts. Please work with your own counsel for legal advice.
This article was originally published on PLI PLUS, the online research database of PLI.