Free and Open Source Large Language Models (LLMs) should be the basis for the future of AI

The AI hype is undeniably happening right now. Everywhere you look online, you see AI this and AI that. A couple of blogs  ago, we wrote about AI, and the introduction is still relevant 

It seems not more than logical, that we in our blog also give some attention to Generative AI, right? AI being this nice bling bling 'new' technology invented by 'OpenAI'. Yeah, right. When I was developing my Colony Prediction application in 2005–2006 I couldn't have thought that almost 20 years later I would be writing on a blog that AI was new. You guessed it by now, probably right. OpenAI did not invent AI, nor is it a new technology. And yes, my Colony Predictor was an AI application, and yes it was successful. It was so successful that we got a scientific publication out of it in 2008 .

In this blog, I want to emphasize on how #foss should be the way to go to create LLM. We will also explain how you can create LLM's yourself, without having to rely on fancy proprietary companies, who try to explain to you that you cannot miss the AI boat and that it's very difficult to create models. Because, of course, they are after your money and DATA. 

What are FOSS LLMs?

Without going into the academic discussion on what is a real-world FOSS Large Language Model is, we give our vision on this. We should take into consideration that there are, in reality, three angles to this problem. The first is, of course, the data that is used to create the LLM. A Large Language Model can, in our opinion, only be considered FOSS if the data used for training the model is also free and open source. In the case of the most proprietary models (even those who claim to be open source, yes Llama Meta, we are talking about you), this is not the case. The second element is the creation of the LLM. Is the software used for creating the model Free and Open Source Software? The application we use is Gpt4all, which is a full FOSS application. In the case of the example of Llama from Meta, you don't get the source code. 

The third element is the openness of the chat results. What do the licenses say about the result of using the GPT chat engine?  Are you the owner/author of the result? Is the result free and open source accessible? Most FOSS GPT engines are FOSS. GPT engines like Llama and ChatGPT claim this too, but how can they claim to be FOSS when the input data and algorithm aren't?

Why should LLMs should be FOSS

As you will expect from OS-SCi we are great advocates of everything being free and open source, so this also should count for AI Large Language Models. For us, the importance of being in control of software you use and which influences your life is always important. But when this software is also using data from third parties, it's even more important. When you take into account, that is often unknown which data is used. People should have the right to be able to check if LLM's have incorporated their personal data. Mainly for this reason, all LLM's should be fully free and open source. 

How can you create your own FOSS LLMs?

Depending on the size of the data pool and the power of the computation available to you as a user, the building of a LLM is more or less quite easy. Time-consuming, but easy. Our manual for building a LLM with Gpt4All fits on one A4 sheet of paper. 

The easy way, adding files to an existing LLM

When using a FOSS application like GPT4All, you can use a free and open source model like Mistral and add your own documents to the localDocs. After you have done that, you can chat with your own documents. To give an example. Last week, someone came to us with a biological research question. In 25 minutes, we could build a LLM from 10 scientific papers and ask relevant questions to the chatbot. 

Conclusion

In this blog you have learned that there are multiple ways to look at Open Source in relation to the multiple aspects of AI and Large Language Models. We have discussed the importance of FOSS in relation to AI and emphasized on the origin of the data, and the right that people have to know and check if their data is part of a LLM. We also have discussed and explained that it's quite easy to set up and use LLM's yourself. You don't need any bogus specialist for that. 


in Reis
Choosing a programming language, and what's makes them Green?