Introduction to the fine tuning in Large Language Models

Fine tuning in LLM

In this article, we're going to delve into the techniques used to adapt a generic pre-trained model for a specific task or application. Now, pre-trained language models can do really impressive stuff just immediately off the shelf, including things like text generation, summarization, and even providing levels of code. But LLMs are not one-size-fits-all as you can imagine and maybe I want to specialize it even further after the model has been trained. I'd like to make it really good at answering questions based on certain topics such as medicine or travel or code or something. This is what the idea of fine-tuning comes in.

Methods used in Fine tuning:

Fine-tuning a large language model refers to the process of taking a pre-trained language model or a model that's already been trained on a large dataset and further training it on a smaller task-specific dataset.

Now this process will allow the model to adapt and specialize its capability. And if we take this trained machine language models that do not perform well with out-of-distribution examples, maybe I would need to do this process of fine-tuning this model and I'm going to use new data to update the parameters and this data will be the specific task data that I'm interested in, such as medicine or travel or whatever topic I'm interested in training my large language model in. Now we have to make sure that we're using the right method for fine-tuning our tasks. Not all forms of fine-tuning are equal and you're going to see that there it's useful for different applications.

In some cases, you may want to repurpose a model for a different application. For example, let's say you have a pre-training large language model and it's really good at generating text. Now you want to use it for a different type of application such as sentiment analysis or topic classification. What we're going to have to do is we're going to have to repurpose the model by making a small change to its architecture.

And this is where we're going to consider whether the model even needs to be reprocessed or repurposed. OK, so if we're going to repurpose our model, what does that mean? Let's make sure that we're connecting our embedding layers to our classifiers.

So let's say that you've got some embeddings and what's going to happen is that the transformer part of the model, we're just going to produce those embeddings for us. Now remember, just as a reminder, embeddings are numerical vectors that capture different pieces of our input prompt. Now the some language models will directly generate these embeddings for us, but others like the GPT families use the embeddings to generate tokens or text. OK, so if we're going to repurpose our model, we have to make sure that we connect the model's embedding layer to this classifier model. And this could be a set of fully connected layers. And what we're going to do is take this layer and map it to other embeddings for those class probabilities.

Let's discuss the types of fine tuning in LLMs:

Now let's say we're going to train our classifiers on those embeddings. That's the idea of what this classifier will do. The LLM's attention layers will not need to be updated at all, but we're going to have to make sure we understand is that this classifier, you're going to need a supervised learning dataset composed of examples of text in the corresponding class, and that's how we're going to train our embeddings. Then what we're going to do is we're going to use a supervised learning approach, make sure that there is an actual label on each of our training data.

So now that we've understand this, we have to move on to how are we going to repurpose this model specifically. So we may need to update our parameter weights. Now this is where we're going to have to do this training process again. In a way, our trained model already has parameter weights in the model that have been set.

But by changing this fine-tuning approach, because we're going to repurpose it using this classifier that we were just talking about in the last slide, we're going to need to update those parameter weights. So for this, you'll need to unfreeze things like the attention layer, and you're going to have to perform a full fine-tuning on the entire model. And this unfreezing of the attention layer is what's going to allow us to have a full unfreezing on that fine-tuning process. And the operation can be computationally expensive and really complicated. And especially for a GPT's model with billions of parameters, it can be very compensation expensive and costs a lot to do this repurpose.

So in some cases, you can keep parts of the model frozen to reduce the cost of fine-tuning, but you have to make sure that you identify the architecture and what parts are going to change. So, there's going to be some more supervised fine-tuning approaches, such as if we have supervised versus unsupervised fine-tuning. Now, in some cases, you just want to update the knowledge of the LLM, so you may want to fine-tune the model on medical literature for a moment or a new language.

For those situations, you can use an unstructured dataset and that's going to be using an unsupervised approach. Now, what this does is it's going to make sure that it's going to change the model's behavior. Now, in some cases, updating the knowledge of the model is not enough and you want to change that model's behavior and so now you're going to have to use a supervised fine-tuning approach or SFT dataset.

Now it's also known as instruction fine-tuning. And what this is we're going to instead of the behavior it could be something like collection of prompts and the corresponding responses. The SFT datasets can be manually curated by users or generated even by other large language models. And it's important in LLM such as ChatGPT, which have been designed to follow user instructions and stay on specific tasks and again, make sure that we understand that the unsupervised approach, if we're just trying to update the knowledge of the LLM, as in we're trying to give it some more data.

So for example, GPT was trained up to a certain point in time, right? 2021 I believe was the cut off as of the making of this article. Well, then what if I want to add more data to it? So this is where I'd use the unsupervised learning approach because I'm just updating the knowledge of that model. I'm not trying to change its behavior in any way.

Now there's also things like reinforcement learning from human feedback, and this is where we, the human, will provide feedback to our large language model. This is where it's going to involve us guiding the large language model, and GPT already does something similar to this where we can provide feedback based on the output that we got from the prompt. So if we think of this, OpenAI will fine tune these models and they're going to fine tune them based on our input. And then this is how we get GPT 4. All sorts of further examples of pushing that model a little further it comes from this human-guided approach of generating information.

Now, you have to make sure that you understand that the reinforcement learning from human feedback is really powerful, but it's complex and expensive. It's a complicated and expensive process that requires recruiting human reviewers, and you have to set up auxiliary models to fine tune the large language model and for the moment, it only companies with an AIL labs with a large technical and financial resources can afford this kind of reinforcement feedback learning. But that's the idea is that when you train an LLM on billions of these tokens, it's going to generate sequences of tokens such as sentences, and the text is mostly coherent and makes sense, but it may not be what the user or the respective application actually requires.

What's going to happen here in the human feedback here we're going to bring humans back in the loop to steer this LLM in the right direction and so we're going to try and provide it some feedback. So, human reviewers could rate the output of the model on prompts and these rating acts as signals to fine tune the model to generate high rating output. Simple as that.

Parameter-Efficient Fine-Tuning (PEFT):

Essentially, quality assurance done by humans. So, the parameter efficient fine-tuning or PFT. This is an interesting area of research and it's reducing the cost of updating the parameters of the models. The goal of the parameter efficient fine-tuning is a set of techniques that can try to reduce the number of parameters that need to be updated. There are various PEFT techniques, one of them is a low-rank adaptation. So, this is a technique that has become especially popular in a lot of open-source language models. But the idea behind these tools is that there is a low dimension matrix and you're trying to represent that in a space that's with a very high accuracy. So, this low-rank adaptation is just a mathematical operation that's done by these large language models and it's working on matrix of data, matrices of data. And it gets a little bit complicated to do and explain on how this works.

But underneath the hood, if you're interested in learning about more about low-rank adaptation and how it operates on this matrix of tokens that are generated, have a look online. You'll see that there is plenty of resources to discuss this. Now, fine-tuning with this will require a matrix instead of updating the parameters of our LLM and the parameter weights of this model are then integrated into the LLM and then added for direct inference. It's essentially the process of it. And then finally, when to not use LLM fine-tuning? Now in some cases, using this kind of fine-tuning isn't really useful or possible. So, one of the things I want to talk to you about is that the APIs with little or no fine-tuning, there's some models are only available through application programming interfaces and they have very limited fine-tuning services for us.

When Not to Use Fine-Tuning:

There are other models that are dynamic or context-sensitive applications. So, this data in the application might change frequently and so fine-tuning the model frequently might not be possible. And so for example, we have data in news related applications that's going to change every single day. So, having to kind of redo that model is just too much, it's too expensive. But there's also the ideas that you might not have enough data to fine-tune the model, insufficient amount of data for these downstream tasks or the domain of our applications. And remember, we could have constantly changing data in that, just like those news data, we might be able to generate a new large language model based on the data. But if it changes too quickly, it becomes very expensive and almost to the point where retraining it every day or every week is very expensive. And you're not going to get the most out of the resources that you're spending on retraining your large language model or fine-tuning your large language model, I should say. So, that is a little bit about fine-tuning a large language model, when to use it, and when not to use it.

W3google