Part-2, LoRA : Low-Rank Adaptation of Large Language Mo...

Part-2, LoRA : Low-Rank Adaptation of Large Language Models

The part-2 of this paper is about how methods like LoRA have been applied to transformer architecture, some GLUE bechmarks, experimentation results of applying LoRA to RoBERTA, DeBERTA, GPT-2 and much more. 

· Transformers,PEFT,LoRA,LLMs,NLP

Introduction

Continuing from my previous blog post about LoRA, this is Part 2 of the research paper summary. Initially, I had intended for this to be a two-part summary. However, after delving into the intricate details presented in the paper, I felt it would be beneficial to describe each aspect in depth for the reader's convenience.

Often, my colleagues and mentees have shared that when they begin reading research papers, they encounter numerous new concepts in literature reviews, experimentation sections, etc. This can be overwhelming, leading them to frequently search online to grasp these concepts fully.

Thus, the motivation behind these summaries is not just to summarise the current paper but also to address the various topics that are briefly mentioned. We will take deep dives into these topics alongside our primary focus on LoRA. So, let's expand this into a three-part blog series and dive into the second part!

Part 2 of this paper discusses how methods like LoRA have been applied to the transformer architecture. It covers GLUE benchmarks for its evaluation and presents experimentation results from applying LoRA to RoBERTA, DeBERTA, GPT-2, and more.

How has LoRa been applied to Transformers?

(Its very important to understand on which module of transformer architecture LoRA has been applied!)

LoRA (Low-Rank Adaptation) is a technique that can be applied to specific subsets of weight matrices within neural networks to minimize the number of parameters that need training. Within the Transformer architecture, there are distinct weight matrices associated with the self-attention mechanism, specifically Wq, Wk, Wv, and Wo. Additionally, there are two weight matrices in the MLP (Multi-Layer Perceptron) module. For the purpose of their study, matrices like Wq, Wk, and Wv are treated as a singular matrix with dimensions dmodel × dmodel, even though in practice, the output from these matrices is often segmented into multiple attention heads. The researchers chose to apply LoRA only to the attention weights and not to the MLP module. This decision is based on simplicity and efficiency. They also indicate a deeper exploration of the effects of adapting different types of attention weight matrices in a later section (Section 7.1). However, the study leaves the exploration of adapting weights in the MLP layers, LayerNorm layers, and biases for future research endeavors.

LoRA's Benefits:

Memory and Storage Reduction: Imagine you have a bookshelf (representing VRAM) that can hold 100 books (parameters). With traditional training, you might use up all the space. But with LoRA, you're only updating a small subset, so maybe you only need space for 10 books.

In the case of GPT-3 175B, this is like reducing the space needed on the bookshelf from 1.2TB (completely filled) to just 350GB (partially filled).

Checkpoint Size Reduction: Let's say each book on your bookshelf has 1000 pages (representing the size of the model checkpoint). With LoRA, since you're only updating a small part, it's like only needing a booklet of 10 pages instead of a full book.

For GPT-3 175B, this reduction is massive: from 350GB (a 1000-page book) to just 35MB (a 10-page booklet).

Swapping Between Tasks: Imagine you have different sets of bookmarks for different topics you're studying. Instead of swapping out entire books (all parameters) for different topics, with LoRA, you just swap out the bookmarks (LoRA weights). This is much quicker and easier.

Training Speedup: Going back to the book analogy, if you had to read and make notes (calculate gradients) on all 1000 pages, it would take a long time. But if you only had to do this for 10 pages (because of LoRA), you'd finish 25% faster.

LoRA's Limitations:

Batching Different Tasks: Imagine you're studying two subjects, Math and History. If your bookmarks (A and B matrices) are different for each subject, it's not straightforward to study both subjects in one sitting (a single forward pass). However, if you're okay with a bit of interruption (latency isn't critical), you can switch between the subjects (tasks) as needed.

Hence, you will realise that how there are some minor challenges in case of multitask for LoRA. 

I hope you were clear now with how LoRa has been applied to the transformers. Now moving on to the next, emperical experiments. Let's see first what was the setup and methods used in these experiements and later on we see the results of these experiments on RoBERTA, DeBERTA and GPT-2. I will make this section as detailed as possible by deep diving into mathematics of some methods along with understanding the GLUE benchmarks in a bit detailed fashion. 

Quick Glimpse/Summary:

LoRA's performance was evaluated on models: RoBERTa, DeBERTa, GPT-2, and scaled up to GPT-3 175B - We will see more about these in the next section
Experiments spanned tasks from natural language understanding (NLU) to generation (NLG).
RoBERTa and DeBERTa were tested on the GLUE benchmark - In this we will first understand what were the GLUE benchmarks used here
GPT-2 evaluation followed the setup of Li & Liang (2021) for direct comparison - You will understand more about this
Additional tests on GPT-3 included WikiSQL (translating natural language to SQL queries) and SAMSum (conversation summarization) - This will be covered in the part-3 of the paper along with the remaining topics
Further details on datasets are available in Appendix C - We will understand this in our current blog on the datasets used
All experiments were conducted using NVIDIA Tesla V100.

Let's first understand the baselines/methods for benchmarking results against LoRA. Most of them were covered in part-1 of the blog, but still let's recap it

In our previous part, we covered 1,4,5,6. Feel free to refer to the previous blog here. I will also explain 3 & 4 in detail here as that involves the most math and was not covered in the previous blog.

Let's understand a fun and simple analogy to above methods. Hope you all will be clear with it :)

Imagine you're trying to adjust the settings on a complex music player to get the perfect sound for different genres.

Fine-Tuning: Adjust all the settings.
Bias-only: Only adjust the volume.
Prefix-embedding: Add special sound effects and adjust their intensity.
Prefix-layer: Adjust the sound effects after each song layer (like vocals, instruments, etc.).
Adapter tuning: Insert special sound filters and adjust their intensity.
LoRA: Add adjustable mini-equalizers alongside the main settings.

Each method offers a different way to tweak the music player (model) to get the desired sound (performance) for different genres (tasks).

Now, lets deep dive into math of 3 & 4. Don't worry, this also has a practical explanation to it!

Now that we've understood the six different methods used for benchmarking against the three models and datasets, let's delve into the GLUE benchmarks used for model evaluation.

Before I begin explaining GLUE, I'd like to share the motivation behind this discussion. A few months ago, few of my mentees posed a question about it. They noted that many individuals in their network had never delved deeply into the GLUE benchmarks or understood the datasets and metrics involved. Grasping these benchmarks is crucial since they constitute a significant portion of the results in every NLP paper. Most NLP papers benchmark against GLUE, which encompasses a broad spectrum of natural language tasks and is available under permissive licenses. Furthermore, it's vital for us to comprehend the metrics associated with each dataset within GLUE. I believe these metrics are fundamental to statistics, with examples like Matthew's correlation for CoLA, among others. So let's first understand this in a technical way right from the paper and then a simple explanation/example to the technical explanation.

(Taken as it is from the paper)

 Now, let's understand each of these datasets in a simple way. Sometime's so much text in a paper makes be worry and consumes lot of eneregy to read (especially after reading lengthy 16 pages and reaching till 17th page for this) :p

MNLI (MultiNLI)

Type: Natural Language Inference (NLI)
Description: Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.
Example:
Sentence 1: "The cat sat on the mat."
Sentence 2: "There is a cat on the mat."
Prediction: Entailment. 
Metric: Accuracy (both matched and mismatched)

SST-2 (Stanford Sentiment Treebank):

Type: Sentiment Analysis
Description: Given a movie review, the task is to determine if it's positive or negative.
Example:
Sentence: "The movie was fantastic!"
Prediction: Positive 
Metric: Accuracy. 

MRPC (Microsoft Research Paraphrase Corpus):

Type: Paraphrase Identification
Description: Given a pair of sentences, the task is to determine whether the two sentences are paraphrases of each other.
Example:
Sentence 1: "The world is round."
Sentence 2: "The earth has a circular shape."
Prediction: Paraphrase 
Metric: Accuracy and F1 score. However, in the context of the table you provided, they seem to be using just accuracy.

CoLA (Corpus of Linguistic Acceptability):

Type: Linguistic Acceptability
Description: Given a sentence, the task is to predict whether it's grammatically correct or not.
Example:
Sentence: "Him wants to go."
Prediction: Unacceptable 
Metric: Matthews correlation coefficient (MCC). MCC is a measure of the quality of binary classifications, providing a balanced metric even if classes are of very different sizes.

QNLI (Question Natural Language Inference):

Type: Question Answering/NLI
Description: It's a version of the Stanford Question Answering Dataset (SQuAD) adapted to the NLI task. Given a question and a sentence, the task is to determine if the sentence contains the answer to the question.
Example:
Question: "What color is the sky?"
Sentence: "The sky is blue during a clear day."
Prediction: Entailment 
Metric: Accuracy.

QQP (Quora Question Pairs):

Type: Duplicate Question Detection
Description: Given two questions, the task is to determine if they are semantically equivalent.
Example:
Question 1: "What's the best way to lose weight?"
Question 2: "How can I reduce my weight effectively?"
Prediction: Duplicate 
Metric: Accuracy and F1 score. Again, in the context of your table, they seem to be using just accuracy.

RTE (Recognizing Textual Entailment):

Type: Natural Language Inference
Description: Similar to MNLI but with a different dataset. Given two sentences, the task is to determine if the second is an entailment or not of the first.
Example:
Sentence 1: "All birds can fly."
Sentence 2: "Penguins cannot fly."
Prediction: Not Entailment 
Metric: Accuracy.

STS-B (Semantic Textual Similarity Benchmark):

Type: Semantic Textual Similarity
Description: Given two sentences, the task is to rate their similarity on a scale from 0 (completely dissimilar) to 5 (completely similar).
Example:
Sentence 1: "A boy is playing soccer."
Sentence 2: "A child is playing football."
Prediction: 4.5 (highly similar) 
Metric: Pearson correlation coefficient, which measures the linear correlation between two datasets.

So this is the complete understanding around the tasks in the GLUE benchmarks against with the models and methods are evaluated. Now, let's get into the last part of the blog, what models were used against these methods and how does the results look like!

Technical Explanation:

RoBERTa:

RoBERTa is a variant of the BERT model. It optimized the pre-training process of BERT, achieving better performance without significantly increasing the number of parameters.
Even though newer models have surpassed RoBERTa on benchmarks like GLUE, RoBERTa remains popular due to its efficiency relative to its size.

Evaluation:

The study aims to evaluate how well RoBERTa can be adapted to specific tasks using different adaptation methods.
They used pre-trained versions of RoBERTa base (with 125M parameters) and RoBERTa large (with 355M parameters) from the HuggingFace Transformers library.
The performance of these models, when adapted using different methods, was tested on tasks from the GLUE benchmark.
They replicated the setups from previous works (Houlsby et al. 2019 and Pfeiffer et al. 2021) to ensure consistency

Changes for Fair Comparison:

They used a consistent batch size for all tasks.
They limited the sequence length to 128 to match the baselines that use adapters.
For certain tasks (MRPC, RTE, and STS-B), they initialized the model with the pre-trained version, rather than a version already adapted to another task (MNLI).

The results of these evaluations are presented in Table 2 that we will see shortly. 

DeBERTa:

DeBERTa is a recent variant of the BERT model. BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed for a wide range of NLP tasks. DeBERTa improves upon BERT by incorporating certain architectural changes.
Scale and Performance: DeBERTa is trained on a significantly larger scale than its predecessors. This scale, combined with its architectural innovations, allows it to achieve competitive performance on benchmarks like GLUE and SuperGLUE.
Evaluation with LoRA: The main focus here is to see if the LoRA (Low-Rank Adaptation) method can still provide competitive performance when applied to a model as large and powerful as DeBERTa XXL (which has 1.5 billion parameters) on the GLUE benchmark. The results of this evaluation are presented in Table 2's bottom section - we will see this shortly
Hyperparameters: Specific details about the settings used during training and evaluation (like learning rate, batch size, etc.) are provided in Section D.2 of the paper - This will be covered in the part-3 of the blog with results and graphs. 

GPT-2:

GPT-2 (Generative Pre-trained Transformer 2) is a model primarily designed for text generation tasks. Unlike BERT or DeBERTa, which are used for understanding tasks (NLU), GPT-2 is used for generating tasks (NLG).
Evaluation with LoRA: The main question being addressed here is whether LoRA, which has been shown to be effective for NLU tasks, can also be effective for NLG tasks when applied to models like GPT-2 medium and large.
Setup: The evaluation setup is kept consistent with the setup used by Li & Liang (2021) to ensure that the results are directly comparable.
Results: Due to space constraints in the paper, results are only presented for the E2E NLG Challenge in Table 3. However, results for other datasets like WebNLG and DART are available in other sections of the paper (Section F.1) - We will see this shortly and some part of it in part-3 blog
Hyperparameters: Specific details about the settings used during training and evaluation for GPT-2 are provided in Section D.3 of the paper - This will be covered in the part-3 of the blog with results and graphs. 

Pheww, we have finally made it to the last part of this blog summary, "The Results Table" that you have been seen mentioning in the above summaries. So let's take a look at the results:

(Here you will clearly notice that either LoRA outperforms the other setups or is quite close to the best benchmark. Honestly, I would anytime prefer LoRA over others with some compromise on accuracy when compared to drastic reduction in memory, compute costs and much more)

Table Structure:

Model & Method: This column lists the model and the method used for adaptation. For example, "RoBbase (FT)" refers to the RoBERTa base model that has been fine-tuned (FT).
# Trainable Parameters: This column indicates the number of parameters that are trainable during the adaptation process. For instance, "0.1M" means there are 100,000 trainable parameters.
The subsequent columns (like MNLI, SST-2, MRPC, etc.) represent different benchmark tasks. The numbers in these columns indicate the performance of the model on that specific task, usually measured in terms of accuracy or some other relevant metric.

Understanding the Results:

RoBERTa Base/Large: The authors evaluated the performance of different adaptation methods on the RoBERTa base and RoBERTa large models.
Performance Metrics: The numbers in the task-specific columns (like MNLI, SST-2, etc.) represent the performance of the model on that task. Higher numbers indicate better performance. For instance, for "RoBbase (FT)" on the MNLI task, the performance is 87.6 (likely a percentage accuracy).
Variations in Methods: The table showcases the performance of different adaptation methods, such as Fine-Tuning (FT), Bias-only (BitFit), Adapter tuning (AdptD, AdptP, AdptH), and LoRA. Each method has a different number of trainable parameters, which impacts the model's performance on the tasks.
Symbols like †: These symbols indicate specific setups or conditions under which the model was evaluated. For instance, runs labeled with "†" followed a more restricted setup from a particular reference.

Finally, let's see the GPT-2 benchmark against E2E NLG challenge. I know we haven't covered this like GLUE and hence I will quickly cover this here for readers convenience. The methods used to evaluate the dataset remains the same. 

E2E NLG Challenge:

The E2E NLG (End-to-End Natural Language Generation) Challenge is a shared task that focuses on the generation of textual descriptions from structured data inputs. In the context of this challenge, the structured data typically consists of sets of attributes and values, and the goal is to produce a coherent and fluent textual description that accurately reflects this data.

For instance, given a structured input like:

{
  "name": "The Blue Lagoon",
  "food": "French",
  "priceRange": "moderate",
  "customerRating": "5 out of 5"
}

A possible generated textual description might be:"The Blue Lagoon is a French restaurant with a moderate price range and has received a customer rating of 5 out of 5."

The challenge evaluates the ability of models to generate such descriptions across various domains and under different conditions.

Evaluation Metrics:

BLEU (Bilingual Evaluation Understudy)

Measures how many n-grams (sequences of n words) in the generated text match the n-grams in the reference text.
It's a popular metric for machine translation but is also used in other NLG tasks.
Ranges from 0 to 1, with 1 being a perfect match.

NIST (National Institute of Standards and Technology)

An extension of BLEU, it also considers the informativeness of the n-grams.
It gives higher scores to less frequent n-grams.

METEOR (Metric for Evaluation of Translation with Explicit ORdering)

Considers precision, recall, synonymy, stemming, and word order to evaluate the quality of generated text.
It's more sophisticated than BLEU and often correlates better with human judgment.

ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence)

Measures the longest common subsequence between the generated text and reference text.
It's particularly useful for tasks like text summarization.

CIDEr (Consensus-based Image Description Evaluation):

Developed for evaluating image captioning tasks.
It measures the similarity between the generated description and several reference descriptions, considering n-grams that are deemed important (frequently appearing in reference descriptions but not in others).

In the context of the E2E NLG Challenge, these metrics help in quantifying how well the generated textual descriptions match the reference descriptions, both in terms of content (are all the facts right?) and fluency (does it read like something a human would say?).

Hope, this is clear to you all. Now, lets quickly take a look at the GPT-2 benchmarks. 

Finally, we have reached to the end of this blog summary, part-2. Below is the quick recap of the things we covered in this part:

How does the paper apply LoRA to the transformer architecture. Some benefits of that
Methods used to benchmark on GLUE and the metrics evaluated against each method
Mathematical explanation of various methods used along with LoRA
Some understanding around the setups used in RoBERTA, DeBERTA and GPT-2

In the next part of the blog, which is the third and final part, we will cover how they have scaled LoRA to GPT-3 models and the setup. We will also see some of the hyperparameters that they have tuned in the setup with some graphical results to it. We will end our blog with the conclusion and several variants of LoRA that have been introduced after this paper. 

I hope you all thoroughly enjoyed reading this article. Please feel free to reach out to me if you have any questions or if you want be to add any more details while summarising these papers. Feedback and comments are always welcomed!

Happy Reading!