Part-1, LoRA : Low-Rank Adaptation of Large Language Mo...

Part-1, LoRA : Low-Rank Adaptation of Large Language Models

The part-1 of this paper is about how methods like LoRA have proved efficient in using LLM models in production. This technique has been introduced by the researchers at Microsoft

· NLP,LLMs,LoRA,PEFT,Transformers

In Part-1 of this research paper summary, we will learn about LoRA in detail and why there was a need for such technique. In the part-2 of the blog, we will see experimentations, ablation study with LoRA used on different LLM models (RoBERTA, DeBERTA, GPT-3 and others)

Introduction to LoRA:

The paper introduces "Low-Rank Adaptation" (LoRA) as a method to adapt large-scale pre-trained language models to specific tasks or domains.
So, how do we manage these LLMs with billions or even trillions of parameters? Various methods have been explored, and today, we'll delve into them (Beyond just summarizing this paper). We'll also discuss LoRA, which stands out as one of the most useful techniques for deploying these LLMs in production.

LoRA addresses this challenge by freezing the pre-trained model weights and introducing trainable rank decomposition matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters for adaptation.

Story Time:

You might be wondering why, to date, we have so many different Large Language Models (LLMs) with most having billions of parameters, some even reaching into the trillions. Reflecting on my past experience, managing BERT  was challenging, especially back in 2018 when it was first introduced. My typical approach consisted of:

                     a) Starting with the pretrained model, conducting a thorough Root Cause Analysis (RCA), and recording the metrics. This was my first experimentation. 

                     b) If the first technique was ineffective, I'd resort to subset fine-tuning. A simpler strategy involved freezing the top layers and playing with only the last few layers/few parameters, with a GPU setup. Afterward, I'd conduct another detailed RCA and record the metrics. This constituted my second experimentation.

(Subset Fine-tuning: A more general approach to fine-tuning involves adjusting only a subset of the pre-trained model's parameters. Using the same example, instead of adjusting all the parameters of the animal recognition model, you might only adjust the parameters related to recognizing dogs)

                    c) If the previous method didn't work well, I'd leverage full fine-tuning, which is among the most intricate tasks in LLMs. In my nine years of experience, only twice have I ventured to fully tune a BERT and LLM models on my own (Days of training involved with multiple GPUs, dont even ask about the cost!!!). However, these instances were not for production deployment but were more research-oriented, with an eye toward patenting the new way of training the Bert. 

(Traditional Fine-tuning: In traditional fine-tuning, you adjust all the parameters of a pre-trained model to adapt it to a new task. For instance, if you have a pre-trained model that recognizes animals, you might fine-tune it to specifically recognize different breeds of dogs)

While techniques a), b), and c) are effective for running models, the real challenge arises when deploying these models for production and real-time inferencing. As we move from a to c, both computational and inference costs escalate exponentially. To date, method c) is the least adopted by organizations.

So, how do we manage these LLMs with billions or even trillions of parameters? Various methods have been explored, and today, we'll delve into them (Beyond just summarizing this paper). We'll also discuss LoRA, which stands out as one of the most useful techniques for deploying these LLMs in production. 

What is LoRA approach?

Low-Rank Adaptation: LoRA takes this concept a step further. Instead of requiring the updates to the weight matrices to be full-rank, LoRA allows these updates to be low-rank. This means that the changes made to the model during adaptation are constrained to capture only the most essential information.

Simple Explanation:

Imagine you have a massive bookshelf (representing the pre-trained model) filled with books (parameters). Now, you want to adapt this bookshelf for a specific topic, say "Space Exploration." Instead of rearranging all the books, you decide to add a small, specialized shelf (low-rank matrix) dedicated to "Space Exploration." This small shelf is much easier to organize and modify than the entire bookshelf.

In technical terms:

Original Model: Think of this as a large matrix (the big bookshelf).
Low-Rank Matrices: These are smaller matrices (the specialized shelves) that are introduced and are trainable. They capture the essential information needed for the new task.
Combining: The original model's parameters and the low-rank matrices are combined to adapt the model to the new task.

Benefits:

Efficiency: By only adjusting the low-rank matrices, LoRA drastically reduces the number of parameters that need to be trained. This is like only organizing a small shelf instead of an entire library.
Memory: Since you're not adjusting the entire model, the GPU memory requirement is much lower.
Performance: Despite its simplicity and efficiency, LoRA can achieve performance on par with or even better than full fine-tuning.

Now, comes the most important part of the paper, the mathematical equations/design of LoRA. Many of you might always wonder on how we can understand these equations in a simple way. Need not worry, let me explain you step by step the mathematical equation with a good story around it!

Mathematical Explanation:

Low-Rank-Parametrized Update Matrices:

Matrix Multiplication in Neural Networks: Neural networks contain many dense layers that perform matrix multiplication. The weight matrices in these layers are typically full-rank, meaning they don't have any linearly dependent rows or columns.

Low Intrinsic Dimension: Aghajanyan et al. (2020) showed that pre-trained language models have a low "intrinsic dimension". This means that even if you project these models into a smaller subspace, they can still learn efficiently.

Low Intrinsic Rank of Updates: The authors hypothesize that the updates to the weights during adaptation also have a low "intrinsic rank". 

Mathematical Representation

Simple Explanation

Hence, LoRA introduces a parameter-efficient way to adapt neural networks by focusing on low-rank updates to the weight matrices, rather than adjusting the entire matrix. This approach is both computationally and memory efficient. I hope that this explanation makes it clear for you all on the mathematical front!

Methods before LoRA

Let's understand what methods existed before LoRA. There existed two methods before LoRA, Adapter Layers and direct optimizing the prompt. The second one we will understand in the part 2. Let's understand the Adapter Layers method and the challenges it faced. 

Adapter Layers: 

Many researchers have proposed inserting adapter layers between existing layers in a neural network. The idea behind adapter layers is to introduce additional, smaller-sized layers between the pre-existing layers of a neural network, allowing for task-specific adaptations without modifying the original weights of the network. This makes the adaptation process more parameter-efficient.

As you can see in the left of the figure, the standard Transformer is used with an additional adapter layer, added after each sub-layer and before adding the skip connection back. The output of the adapter layer is then forwarded to the layer normalization.

Variants of Adapter Layers: 

The paper mentions two specific variants of adapter tuning: AdapterL and AdapterH. AdapterL is a more efficient variant proposed by Lin et al. (2020), while AdapterH refers to the original design by Houlsby et al. (2019). (Dont worry if you don't understand the image caption)

Now, the question is what are the challenges with the Adapter Layers?

Inference Latency: One of the challenges with adapter layers is that they introduce additional latency during inference, especially when the model's batch size and/or sequence length are not large enough to fully utilize hardware parallelism. This added latency can be significant in online scenarios where the batch size is small. In contrast, LoRA does not introduce this additional latency, making it more suitable for real-time applications. Not clear? - Let me explain you practically on how inference Latency gets added:

Imagine a Transformer model with 10 blocks. If we use the original design by Houlsby et al. (2019), we would add 2 adapter layers to each of these blocks, resulting in 20 additional layers. If we use the design by Lin et al. (2020), we would add only 1 adapter layer but with an extra LayerNorm, so we still have 10 additional layers.

Latency: 

Let's say, without any adapters, our model takes 10 milliseconds (ms) to process a single input on a GPU. Now, even if each adapter layer adds just 0.5 ms of processing time (because they are designed to be lightweight with few parameters), with 20 adapter layers, we're looking at an additional 10 ms (20 layers * 0.5 ms/layer) of latency. This doubles the inference time to 20 ms.

Batch Size and Parallelism: 

Large neural networks often rely on processing multiple inputs at once (batch processing) to optimize the use of hardware. However, in online settings, where you might be processing user queries one at a time, the batch size is typically just one. This means you can't take advantage of this parallelism, and the added latency from adapter layers becomes more pronounced.

Model Sharding: 

Sometimes, models are too large to fit on a single GPU, so they are split (or "sharded") across multiple GPUs. In such scenarios, adding depth to the model (like with adapter layers) can introduce more synchronization operations between GPUs, further increasing latency. For instance, if every adapter layer requires an additional 0.2 ms for synchronization across GPUs, our 20 adapter layers would add another 4 ms, taking our total inference time to 24 ms.

Storage Overhead: 

The paper also hints at a potential solution to the synchronization problem: storing adapter parameters redundantly across GPUs. However, this would increase the memory usage and isn't always feasible.

Phew, these were the problems that had to be tackled with the Adapter Layers before LoRA. Even though adapter layers are meant to be designed for light-weight and parameter-efficient, they still bring the latency to the table when compared to LoRA.  

Apart from this adapter layer method, some prefix based methods were also introduced that struggled high time in handling long range sequences. Let me explain you briefly about those:

Prefix Based Methods:

Prefix-based methods involve training a prefix (a set of initial layers or modules) that processes the input before it's fed into the pre-trained model. This prefix is designed to adapt the input for the specific task at hand.

(Comparison with LoRA: As you increase the number of trainable parameters in prefix-based methods, there's a limitation: they become constrained in their ability to handle long input sequences. This is because the prefix, being a separate module, consumes a portion of the model's capacity to process sequences. So, if your input sequence is very long, the prefix might use up a significant portion of the model's sequence processing capacity, leaving less capacity for the main model to process the adapted input)

In simpler terms, imagine you have a conveyor belt (representing the model's capacity) that can handle 100 items at a time. If the prefix method uses up space for 50 items, then the main model only has space for 50 more items. This contrasts with LoRA, which doesn't have this limitation because it doesn't introduce separate modules that consume sequence processing capacity.

This is in brief about Prefix-Based methods. 

Finally, LoRA offers a lot of benefits. In the part-2 of the blog we will some interesting experiments done on LLMs with ablation study and understand some amazing results of the paper . To summarise the above blog:

A pre-trained model can be shared across many tasks. By freezing the shared model and switching out the matrices A and B (from the rank decomposition), storage requirements and task-switching overhead are significantly reduced.
LoRA makes training more efficient and lowers the hardware barrier. This is achieved by not having to calculate gradients or maintain optimizer states for most parameters. Instead, only the injected, much smaller low-rank matrices are optimized.
The design of LoRA allows for merging the trainable matrices with the frozen weights when deployed, ensuring no additional inference latency compared to a fully fine-tuned model.
LoRA can be combined with many prior methods, such as prefix-tuning.

Happy Reading!