seirdy.one/content/posts/experiment-copilot-legality.gmi

I am not a lawyer. This post is satirical commentary on:

* The absurdity of Microsoft and OpenAI’s legal justification for GitHub Copilot.
* The oversimplifications people use to argue against GitHub Copilot (I don’t like it when people agree with me for the wrong reasons).
* The relationship between capital and legal outcomes.
* How civil cases seem like sporting events where people “win” or “lose”, rather than opportunities to improve our understanding of law.

In the process, I intentionally misrepresent how the judicial system works: I portray the system the way people like to imagine it works. Please don’t make any important legal decisions based on anything I say.

The only section you should take seriously is “Context”.

## Introduction

GitHub is enabling copyleft violation ✨at scale✨ with Copilot. GitHub Copilot encourages people to make derivative works of source code without complying with the original code’s license. This facilitates the creation of permissively-licensed or proprietary derivatives of copyleft code.

Unfortunately, challenging Microsoft (GitHub’s parent company) in court is a bad idea: their legal budget probably ensures their victory, and they likely already have a comprehensive defense planned. How can we determine Copilot’s legality on a level playing field? We can create legal precedent that they haven’t had a chance to study yet!

A chat with Matt Campbell about a speech synthesizer gave me a horrible idea. I think I know a way to find out if GitHub Copilot is legal: we could use its legal justification against another software project with a smaller legal budget. Specifically, against a speech synthesizer. The outcome of our actions could set a legal precedent to determine the legality of Copilot.

## Context: the relevant technologies

Let’s cover the technologies and actors at play _before_ I start my evil monologue.

### Exhibit A: GitHub Copilot

GitHub Copilot is a predictive autocompletion service for writing software. It’s powered by OpenAI Codex, a language model based on GPT-3. It was trained using the source code of public repositories hosted on GitHub, regardless of their licensing.
=> https://openai.com/blog/openai-codex/ OpenAI Codex announcement
=> https://en.wikipedia.org/wiki/GPT-3 Wikipedia on GPT-3

In response to a Request for Comments from the US Patent and Trademark Office, OpenAI claimed that “Artificial Intelligence Innovation”, such as code written by GitHub Copilot, should be considered “fair use”:
=> https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation

Many of the code snippets it suggests are exact copies of source code from various GitHub repositories. For an example, see this tweet:
=> https://twitter.com/mitsuhiko/status/1410886329924194309 "I don't want to say anything but that's not the right license Mr Copilot." by Armin Ronacher
=> https://web.archive.org/web/20220701010012/https://nitter.pussthecat.org/mitsuhiko/status/1410886329924194309 archive link that doesn’t require JavaScript, captured on 2022-07-01

It contains a screen recording of Copilot suggesting this Quake code:
=> https://github.com/id-Software/Quake-III-Arena/blob/dbe4ddb10315479fc00086f08e25d968b4b43c49/code/game/q_math.c#L552 Quake III source code snippet

When prompted to do so, it obediently fills in a permissive license. That permissive license violates the Quake code’s GPL-2.0 license. Copilot provides no indication that a license violation is taking place.

GitHub performed its own research into the matter.[1] You can read about it on their blog:
=> https://github.blog/2021-06-30-github-copilot-research-recitation/ GitHub Copilot research recitation, by Albert Ziegler

I’m not convinced that it accounts for the fact that suggested code might have mechanical alterations to match surrounding text, while still remaining close enough to trained data to be a license violation.

### Exhibit B: The Eloquence speech synthesizer

I recently had a chat with Matt on IRC about screen readers and different types of speech synthesizers. I mentioned that while I do like some variety, I always find myself returning to the underrated robotic voice of eSpeak NG. He shared some of my fondness, and also shared his preference for a similar speech synthesizer called Eloquence.

=> https://github.com/espeak-ng/espeak-ng/ eSpeak NG Git repository

Downloads of Eloquence are easy to find (it’s even included with the JAWS screen reader), but I struggle to find any “official” pages about the original Eloquence. Nuance acquired Eloquent Technology, the developer of Eloquence. Microsoft later acquired Nuance.

I like the Eloquence speech synthesizer. It sounds similar to the robotic yet predictable voice of my beloved eSpeak NG, but with improved overall quality. Unfortunately, Eloquence is proprietary.

#### Eloquence sample audio

=> gemini://seirdy.one/misc/eloquence.opus Sample audio of Eloquence (audio/ogg, opus codec)

Matt recorded this sample audio clip of Eloquence reading some text. The text is from the introduction of another post of mine:
=> ../../../../2020/11/23/website-best-practices/ Best practices for inclusive textual websites

### Exhibit C: Deep learning speech synthesis

Deep learning speech synthesis is a recent approach to speech synthesizer creation. It involves training a deep neural network on voice samples, and using the trained model to generate speech similar to a real human voice. One synthesizer using deep learning speech synthesis is Mozilla’s TTS.
=> https://en.wikipedia.org/wiki/Deep_learning_speech_synthesis Wikipedia on deep learning speech synthesis
=> https://github.com/mozilla/TTS Mozilla TTS is an example of deep speech synthesis

Zero-shot approaches could allow a pre-trained model to generate multiple different voices. This could allow us to synthetically re-create a person’s voice more easily.
=> https://doi.org/10.48550/arXiv.2112.02418 YourTTS is one such example.

## My horrible plan

My horrible plan revolves around going through two different lawsuits to set some judicial precedents; these precedents could improve the odds of succeeding in a lawsuit against Microsoft for Copilot’s licensing violations.

If this succeeds, we have new legal justification that GitHub Copilot is illegal; if it fails, we have still gained a means to legally re-create proprietary software. It’s a win-win situation.

### Part One: set a precedent

1. Train a modern text-to-speech (TTS) engine using the voice a proprietary one made by a company with a small legal budget. Keep the model’s internals hidden.
2. Then release the final TTS under a permissive license. Remember, we’re still keeping the machine-learning model hidden!
3. Wait for that company to file suit (or file an anticipatory suit against them; it's common for declaratory judgement regarding intellectual property rights).
4. Win or lose the case.

### Part Two: use that precedent against Microsoft’s Nuance

Our goal here is to get the same legal outcome as the low-stakes “trial run” of Part One.

Microsoft owns Nuance. Nuance previously bought Eloquent Technology, the developers of the Eloquence speech synthesizer.

1. Repeat Part One against Nuance speech synthesizers, including Eloquence. Go to court.
2. Have the ruling from Part One cited as legal precedent.
3. Achieve the same outcome as Part One, demonstrating that we have indeed set precedent that works against Microsoft’s legal department.

### Implications of the outcomes

If we _win_ both cases: Microsoft has the legal high ground. Making a derivative of a copyrighted work using a machine-learning algorithm allows us to bypass copyright licenses.

If we _lose_ both cases: Microsoft does not have the legal high ground. We have good judicial precedent against Microsoft to use when filing suit for Copilot’s behavior.

Either way, it’s an absolute win for free software. Taking down Copilot protects copyleft from enabling proprietary derivatives (and by extension, protects software freedom). But if we accidentally win these two low-stakes “test” cases, we still gain something else: we can liberate huge swaths of proprietary software, starting with speech synthesizers.

## Update: on satire

This post isn't "satire through-and-through" like something from The Onion. Rather, my intent was to make some clear points, but extrapolate them to absurdity to highlight other problems. I don't think I was clear enough when doing this. I'm sorry.

Copilot has been found to suggest significant amounts of code that is dangerously similar to existing works. It does this without disclosing obligations that come with those works' licenses. Training a model on copyrighted works may not be wrong in and of itself; however, using that model to generate new works that are not sufficiently distinct from original works is where things get problematic. Copilot's users could apply proprietary licenses to the generated works, defeating the point of copyleft.

When a tool almost exclusively encourages problematic behavior, the makers of that tool should have put thought into its implications. GitHub and OpenAI have not demonstrated a sufficiently careful approach.

I don't think that "going after" a smaller player just to manipulate our legal system is a good thing to do. The fact that this idea seems plausible to some of my readers shows how warped our perception of the judicial system is. Even if it's accurate (I doubt it's accurate, but I'm not certain), it's sad. Judicial systems incentivise too much predatory behavior.

## Corrections

It's come to my attention that Eloquence may or may not still belong to Nuance. Further research is needed.
Eloquent Technology was acquired by SpeechWorks in 2000.

## Footnotes

1. I doubt anybody worth their salt would count on a company to hold itself accountable, but at least they tried.