Could AI generated programming code be a possible infringement of software licenses? This question, I asked myself this morning, going through all kind of interesting news. To generate "new" stuff AI software needs training data. The training data is put into the machine learning algorithms to train the AI engine. So if you want to create an AI solution to write computer code, like chat-gpt, Git-Hub or others have done, you need lots of data to train your model. In the case of training the AI for generating programming code, the input to the model is known programming code. But what programming code is known?
Proprietary code? Difficult, the code is closed and thus secret, only if the proprietary vendor will give the source to the AI company which I guess they won't be doing. Maybe Microsoft will give some source to OpenAI as they own the company.
Free and Open Source code? Sounds reasonable, it's available, they can download it and add it to their machine learning tooling. So lets assume AI LLM’s use primarely foss code for their models. But now we are going to dive in to my original question ... Could this lead to an infringement of software licenses?
All, yes really all, Free and Open Source Software come with a license. There is a huge plethora of different licenses, like GPL, Apache, MIT and much more. These licenses consist of rules which you have to obey if you want to use, change or publish your version of the software. If you create code with an AI based on this foss software you also have to obey the said license. In most cases, you at least have to cite the original source, which I haven't seen in any AI tool I tested.
Let's use an example. So There's some code which I have developed on sorting algorithms. I've published this source code with a GPL-3 license. Another programmer has also developed a sorting algorithm and published it with an MIT license. While the MIT license is permissive, placing very few restrictions on how the software can use used, the GPL-3 is something else. The GPL-3 license is restrictive, requiring that any changes made to the code be released under the same license, and that any software that uses the code must also be released under the same license. Now let's assume an AI LLM model uses both pieces of code as input. Now another programmer uses the AI tool to generate a sorting algorithm, which the AI bases on the presented input that I and the other programmer have written, the first with a GPL-3 license and the second with a MIT license. So what is now the license of the generated code? and why doesn't the AI tell which license is of use?
Most Foss licenses state that the software needs to stay open. Doesn't this mean we should be able to see how the foss software is incorporated in the Large Language Model of the AI?
Doesn't this mean the Large Language Model, which partly consists of the foss source code, needs to be Open Source itself?
And how does the LLM cope with different licenses of the many different sources? I would really like and see, if we can get a discussion going on this subject.