AI Could Be Your Next Team for Clean Room Development

heatherjmeeker

10 months ago

Clean room developments are necessary when a developer wants to “cleanse” the intellectual property burden of third party software. The need arises when third party software is provided under unacceptable license terms, or not licensed at all. This is one of the trickiest tasks in software development, but it has a long history of best practices.

The canonical clean room development seeks to avoid trade secrets of proprietary software. But the rise of open source has resulted in the need to do a different kind of clean room project, meant to avoid the copyright in open source software–usually for GPL licensed packages. The two situations call for a slightly different approach. A clean room process for proprietary code seeks to avoid trade secrets and copyright burdens, whereas clean room development in open source is entirely about copyright–because there are no trade secrets in open source software. In either case, a team of developers seeks to write new implementing code from scratch, so that code will perform the same tasks, with the same inputs and outputs, as the original or “target” code.

A traditional clean room development process looks something like this:

Separate Development Teams: Create two teams of developers: a specification team that works on specification development for the target code, and an implementation team that writes the new implementing code.
Create a Specification: The specification team, which has access to the target code, extracts the specifications for the software’s requirements and expected behavior. Software, at the end of the day, is a set of inputs and outputs, and its specifications state what outputs you should expect when certain inputs are used.
Reimplementation: The implementation team writes the new software according to the specification developed by the specification team. This must be done in an environment that is “cleansed” of the target code. Ideally, the implementation code has never read the target code.
Verification: The development team tests the newly implemented clean code. If there are bugs, the specification team can only confirm the accuracy of the specification. The specification team cannot suggest bug fixes, because that might result in inadvertent copying. Bug fixes are done by the implementation team.
Iterate: Repeat until the development is done.

Of course, there are far more complex processes for clean room development. Some have three teams, and most have a lot more steps. I have seen guidelines so many pages long they have a table of contents. But the above is the essence–not to mention the most my clients have the patience to read.

Not Enough Humans

The problem most companies have when performing a clean room development is that they don’t have the resources to create two separate teams. Even if they do, they usually cannot create an implementation team that has never been exposed to the target software–and doing so is particularly difficult when the target software is open source, because there is no way to prove lack of access to publicly available materials. For an open source clean room process, we usually make do with developing implementing code in an environment that does not have local access to the target code.

But now, with the advent of AI, we have an alternative way to approach clean room development.

I pause here to note that while there are those who think that all generative AI is prima facie copyright infringement, I don’t agree. As long as the model has been trained on enough inputs, it should not parrot any one input. (More on that here.) So let’s set that issue aside, because if you disagree with me, you shouldn’t be using AI coding tools at all and you should just put this article aside.

An AI that writes code (like Claude or Co-Pilot) has probably been exposed to almost all the open source code ever written. But via that training process, it is unlikely to focus on specific target code. So, companies struggling to staff a clean room development might consider replacing one or both of the teams with AI. As always, some human oversight is necessary to check that an AI generative process has been done correctly. But using it would still greatly reduce the headcount necessary to implement the clean room process.

Specification Team. AI is better at some tasks than others, but I have found that AI is quite good at summarizing text. If you ask it to write the specifications for target software, it will probably do a good job. You could use an AI for your specification team, and that would help avoid “contaminating” your implementation team with access to the target software.
Implementation Team. AI is quite good at writing code, though more human oversight would probably be necessary to use AI for this purpose. AI-assisted coding still requires human curation, and also usually requires human debugging. Debugging is a complex logical task, and the current flavor of AI–transformation based models–are better at text generation than logic. But in a pinch, you might use AI as your implementation team and use the specification team for quality control.

Neither of these suggestions should be surprising. AI code generation greatly reduces the human effort necessary to produce code, and clean room projects are human-intensive. For an open source target, I think the use of AI as a specification team is quite interesting. For proprietary code, using AI as the implementation team may be particularly interesting, because AIs are mostly not trained on proprietary code, making the cleansing more reliable.

Always remember: wash your hands before you code!

Share this: