RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper : https://arxiv.org/pdf/1907.11692.pdf
Code :
https://github.com/pytorch/fairseq
https://github.com/pytorch/fairseq/blob/main/examples/roberta/README.md
Description : RoBERTa
Abstract
Motivation (Background of Study)
- Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results.
Achievement (Research Result)
- We present a replication study of BERT pretraining (Devlin et al. , 2019) that carefully measures the impact of many key hyperparameters and training data size
- We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it
Impact (Significance of Study)
- Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
- These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements.
1. Introduction
Limitations of self-training
- Self-training methods such as ELMo (Peters et al. , 2018), GPT (Radford et al. , 2018), BERT (Devlin et al. , 2019), XLM (Lample and Conneau , 2019), and XLNet (Yang et al. , 2019) have brought significant performance gains, but it can be challenging to determine which aspects of the methods contribute the most.
- Training is computationally expensive, limiting the amount of tuning that can be done, and is often done with private training data of varying sizes, limiting our ability to measure the effects of the modeling advance.
Research Objective
- We present a replication study of BERT pretraining (Devlin et al., 2019), which includes a careful evaluation of the effects of hyperparmeter tuning and training set size.
- We find that BERT was significantly undertrained and propose an improved recipe for training BERT models, which we call RoBERTa, that can match or exceed the performance of all of the post-BERT methods.
Research Procedures
- (1) training the model longer, with bigger batches, over more data
- (2) removing the next sentence prediction objective
- (3) training on longer sequences
- (4) dynamically changing the masking pattern applied to the training data
- We also collect a large new dataset (CC-NEWS) of comparable size to other privately used datasets, to better control for training set size effects
Research Result
- (1) We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance
- (2) We use a novel dataset, CCNEWS, and confirm that using more data for pretraining further improves performance on downstream tasks
- (3) Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods
2. Background
- a brief overview of the BERT (Devlin et al., 2019) pretraining approach and some of the training choices that we will examine experimentally in the following section
2.1 Setup
- M and N are constrained such that M + N < T, where T is a parameter that controls the maximum sequence length during training
2.2 Architecture
- We use a transformer architecture with L layers. Each block uses A self-attention heads and hidden dimension
2.3 Training Objectives
- Masked Language Model (MLM)
- The MLM objective is a cross-entropy loss on predicting the masked tokens.
- BERT uniformly selects 15% of the input tokens for possible replacement. Of the selected tokens, 80% are replaced with [MASK], 10% are left unchanged and 10% are replaced by a randomly selected vocabulary token.
- In the original implementation, random masking and replacement is performed once in the beginning and saved for the duration of training, although in practice, data is duplicated so the mask is not always the same for every training sentence
- Next Sentence Prediction (NSP)
- NSP is a binary classification loss for predicting whether two segments follow each other in the original text
- Positive and negative examples are sampled with equal probability
- The NSP objective was designed to improve performance on downstream tasks, such as Natural Language Inference (Bowman et al., 2015), which require reasoning about the relationships between pairs of sentences
2.4 Optimization
- BERT is optimized with Adam (Kingma and Ba, 2015) using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01
- The learning rate is warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed
- BERT trains with a dropout of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016)
- Models are pretrained for S = 1,000,000 updates, with minibatches containing B = 256 sequences of maximum length T = 512 tokens
2.5 Data
- BERT is trained on a combination of BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA, which totals 16GB of uncompressed text
3. Experimental Setup
- we describe the experimental setup for our replication study of BERT
3.1 Implementation
- We primarily follow the original BERT optimization hyperparameters, given in Section 2 except for the peak learning rate and number of warmup steps, which are tuned separately for each setting
- We additionally found training to be very sensitive to the Adam epsilon term, and in some cases we obtained better performance or improved stability after tuning it
- We found setting β2 = 0.98 to improve stability when training with large batch sizes
- We pretrain with sequences of at most T = 512 tokens.
- Unlike Devlin et al. (2019), we do not randomly inject short sequences, and we do not train with a reduced sequence length for the first 90% of updates.
- We train only with full-length sequences.
- train with mixed precision floating point arithmetic on DGX-1 machines, each with 8 × 32GB Nvidia V100 GPUs interconnected by Infiniband
3.2 Data
- BOOKCORPUS (Zhu et al., 2015) plus English WIKIPEDIA. This is the original data used to train BERT. (16GB)
- CC-NEWS, which we collected from the English portion of the CommonCrawl News dataset (Nagel, 2016). The data contains 63 million English news articles crawled between September 2016 and February 2019. (76GB after filtering)
- OPENWEBTEXT (Gokaslan and Cohen, 2019), an open-source recreation of the WebText corpus described in Radford et al. (2019). The text is web content extracted from URLs shared on Reddit with at least three upvotes. (38GB)
- STORIES, a dataset introduced in Trinh and Le (2018) containing a subset of CommonCrawl data filtered to match the story-like style of Winograd schemas. (31GB).
3.3 Evaluation
- GLUE
- The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019b) is a collection of 9 datasets for evaluating natural language understanding systems.
- Tasks are framed as either single-sentence classification or sentence-pair classification tasks
- The GLUE organizers provide training and development data splits as well as a submission server and leaderboard that allows participants to evaluate and compare their systems on private held-out test data
- single-task training data (i.e., without multi-task training or ensembling)
- Our finetuning procedure follows the original BERT paper
- SQuAD
- The Stanford Question Answering Dataset (SQuAD) provides a paragraph of context and a question
- The task is to answer the question by extracting the relevant span from the context
- Evaluate on two versions of SQuAD: V1.1 and V2.0
- For SQuAD V1.1 we adopt the same span prediction method as BERT
- For SQuAD V2.0, we add an additional binary classifier to predict whether the question is answerable, which we train jointly by summing the classification and span loss terms
- RACE
- The ReAding Comprehension from Examinations (RACE) (Lai et al., 2017) task is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions
- The dataset is collected from English examinations in China, which are designed for middle and high school students.
- In RACE, each passage is associated with multiple questions. For every question, the task is to select one correct answer from four options
- RACE has significantly longer context than other popular reading comprehension datasets and the proportion of questions that requires reasoning is very large
4. Training Procedure Analysis
- This section explores and quantifies which choices are important for successfully pretraining BERT models.
- We keep the model architecture fixed.
- Begin by training BERT models with the same configuration as BERTBASE (L = 12, H = 768, A = 12, 110M params).
4.1 Static vs. Dynamic Masking
- The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask
- To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of trainingg. Thus, each training sequence was seen with the same mask four times during training.
- We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model
- This becomes crucial when pretraining for more steps or with larger datasets.
- RESULT
- We find that our reimplementation with static masking performs similar to the original BERT model, and dynamic masking is comparable or slightly better than static masking
- Given these results and the additional efficiency benefits of dynamic masking, we use dynamic masking in the remainder of the experiments
4.2 Model Input Format and Next Sentence Prediction
- In addition to the masked language modeling objective, the model is trained to predict whether the observed document segments come from the same or distinct documents via an auxiliary Next Sentence Prediction (NSP) loss
- The NSP loss was hypothesized to be an important factor in training the original BERT model. Devlin et al. (2019) observe that removing NSP hurts performance, with significant performance degradation on QNLI, MNLI, and SQuAD 1.1.
- However, some recent work has questioned the necessity of the NSP loss (Lample and Conneau, 2019; Yang et al., 2019; Joshi et al., 2019).
- To better understand this discrepancy, we compare several alternative training formats:
- SEGMENT-PAIR+NSP
- This follows the original input format used in BERT (Devlin et al., 2019), with the NSP loss. Each input has a pair of segments, which can each contain multiple natural sentences, but the total combined length must be less than 512 tokens
- ex) [CLS] Hello, I'm BERT. I'm originated from Transformer. [SEP] Hi I'm GPT2.[SEP] < 512 tokens
- SENTENCE-PAIR+NSP
- Each input contains a pair of natural sentences, either sampled from a contiguous portion of one document or from separate documents. Since these inputs are significantly shorter than 512 tokens, we increase the batch size so that the total number of tokens remains similar to SEGMENT-PAIR+NSP. We retain the NSP loss.
- ex) [CLS] Hello, I'm BERT. [SEP] Hi I'm GPT2 [SEP]
- Mostly shorter than 512 tokens -> increase the batch to get the similar number of tokens for pre-training to compare with SEGMENT-PAIR+NSP procedure (and others)
- FULL-SENTENCES
- Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents. We remove the NSP loss
- ex) [CLS] Hello, I'm BERT. I'm originated from Transformer. I was the SOTA model [SEP] Hi, I'm GPT2 [SEP]
- DOC-SENTENCES
- Inputs are constructed similarly to FULL-SENTENCES, except that they may not cross document boundaries. Inputs sampled near the end of a document may be shorter than 512 tokens, so we dynamically increase the batch size in these cases to achieve a similar number of total tokens as FULLSENTENCES. We remove the NSP loss
- ex) [CLS] Hello, I'm BERT. I'm originated from Transformer. I was the SOTA model [SEP]
- Shorter than 512 tokens -> increase the batch to get the similar number of tokens for pre-training to compare with FULL-SENTENCES procedure (and others)
- SEGMENT-PAIR+NSP
- RESULT
- SENTENCE-PAIR format hurts performance on downstream tasks, which we hypothesize is because the model is not able to learn long-range dependencies.
- DOC-SENTENCES outperforms the originally published BERTBASE results and that removing the NSP loss matches or slightly improves downstream task performance, in contrast to Devlin et al. (2019).
- It is possible that the original BERT implementation may only have removed the loss term while still retaining the SEGMENT-PAIR input format.
- 즉 downstream task에 대해 format 은 SEGMENT-PAIR 로 유지하면서 NSP loss term만 제거하여 상대적으로 성능이 낮을 수 있다는 말
- Restricting sequences to come from a single document (DOC-SENTENCES) performs slightly better than packing sequences from multiple documents (FULL-SENTENCES)
- However, because the DOC-SENTENCES format results in variable batch sizes, we use FULL-SENTENCES in the remainder of our experiments for easier comparison with related work
4.3 Training with large batches
- We observe that training with large batches improves perplexity for the masked language modeling objective, as well as end-task accuracy
- RESULT
4.4 Text Encoding
- Byte-Pair Encoding (BPE) (Sennrich et al., 2016) is a hybrid between character- and word-level representations that allows handling the large vocabularies common in natural language corpora.
- BPE vocabulary sizes typically range from 10K-100K subword units. However, unicode characters can account for a sizeable portion of this vocabulary when modeling large and diverse corpora, such as the ones considered in this work.
- Radford et al. (2019) introduce a clever implementation of BPE that uses bytes instead of unicode characters as the base subword units. Using bytes makes it possible to learn a subword vocabulary of a modest size (50K units) that can still encode any input text without introducing any “unknown” tokens
- The original BERT implementation (Devlin et al., 2019) uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules.
- Following Radford et al. (2019), we instead consider training BERT with a larger byte-level BPE vocabulary containing 50K subword units, without any additional preprocessing or tokenization of the input.
- This adds approximately 15M and 20M additional parameters for BERTBASE and BERTLARGE, respectively.
- Early experiments revealed only slight differences between these encodings, with the Radford et al. (2019) BPE achieving slightly worse end-task performance on some tasks.
- Nevertheless, we believe the advantages of a universal encoding scheme outweighs the minor degredation in performance and use this encoding in the remainder of our experiments
5. RoBERTa
- We call this configuration RoBERTa for Robustly optimized BERT approach.
- RoBERTa is trained with
- dynamic masking (Section 4.1)
- FULL-SENTENCES without NSP loss (Section 4.2)
- large mini-batches (Section 4.3)
- larger byte-level BPE (Section 4.4)
- we investigate two other important factors that have been under-emphasized in previous work
- (1) the data used for pretraining
- (2) the number of training passes through the data
- We begin by training RoBERTa following the BERTLARGE architecture (L = 24, H = 1024, A = 16, 355M parameters)
- We pretrain for 100K steps over a comparable BOOKCORPUS plus WIKIPEDIA dataset as was used in Devlin et al. (2019)
- We pretrain our model using 1024 V100 GPUs for approximately one day
- RESULT
- RoBERTa provides a large improvement over the originally reported BERTLARGE results, reaffirming the importance of the design choices we explored in Section 4
- Next, we combine this data with the three additional datasets described in Section 3.2. We observe further improvements in performance across all downstream tasks, validating the importance of data size and diversity in pretraining
- Finally, we pretrain RoBERTa for significantly longer, increasing the number of pretraining steps from 100K to 300K, and then further to 500K.
- We again observe significant gains in downstream task performance, and the 300K and 500K step models outperform XLNetLARGE across most tasks
- We note that even our longest-trained model does not appear to overfit our data and would likely benefit from additional training
5.1 GLUE Results
- For GLUE we consider two finetuning settings.
- In the first setting (single-task, dev) we finetune RoBERTa separately for each of the GLUE tasks, using only the training data for the corresponding task.
- We consider a limited hyperparameter sweep for each task, with batch sizes ∈ {16, 32} and learning rates ∈ {1e−5, 2e−5, 3e−5}, with a linear warmup for the first 6% of steps followed by a linear decay to 0.
- We finetune for 10 epochs and perform early stopping based on each task’s evaluation metric on the dev set.
- The rest of the hyperparameters remain the same as during pretraining. In this setting, we report the median development set results for each task over five random initializations, without model ensembling.
- In the second setting (ensembles, test), we compare RoBERTa to other approaches on the test set via the GLUE leaderboard.
- For RTE, STS and MRPC we found it helpful to finetune starting from the MNLI single-task model, rather than the baseline pretrained RoBERTa. We explore a slightly wider hyperparameter space, described in the Appendix, and ensemble between 5 and 7 models per task
- Task-specific modifications
- Two of the GLUE tasks require task-specific finetuning approaches to achieve competitive leaderboard results.
- QNLI
- Adopt a pairwise ranking formulation for the QNLI task
- Candidate answers are mined from the training set and compared to one another, and a single (question, candidate) pair is classified as positive (Liu et al., 2019b,a; Yang et al., 2019)
- This formulation significantly simplifies the task, but is not directly comparable to BERT (Devlin et al., 2019)
- Following recent work, we adopt the ranking approach for our test submission, but for direct comparison with BERT we report development set results based on a pure classification approach
- WNLI
- We found the provided NLI-format data to be challenging to work with.
- Instead we use the reformatted WNLI data from SuperGLUE (Wang et al., 2019a), which indicates the span of the query pronoun and referent.
- We finetune RoBERTa using the margin ranking loss from Kocijan et al. (2019).
- For a given input sentence, we use spaCy (Honnibal and Montani, 2017) to extract additional candidate noun phrases from the sentence and finetune our model so that it assigns higher scores to positive referent phrases than for any of the generated negative candidate phrases.
- One unfortunate consequence of this formulation is that we can only make use of the positive training examples, which excludes over half of the provided training examples.1
- QNLI
- Two of the GLUE tasks require task-specific finetuning approaches to achieve competitive leaderboard results.
- Pairwise ranking
- negative sampling 이라고 생각하면 편함
- Candidate answers are mined from the training set
- single pair (question, candidate) -> positive
- other pairs (question, candidate) -> negative
- QNLI가 korquad에서 와서 할수 있는 방법
- Margin ranking loss
- 이것 또한 negative sampling 이라고 생각하면 편함
- 이경우 (anchor, positive, negative1, negative2, etc) 로 구성
- 참고 링크
- RESULT
- In the first setting (single-task, dev), RoBERTa achieves state-of-the-art results on all 9 of the GLUE task development sets.
- Crucially, RoBERTa uses the same masked language modeling pretraining objective and architecture as BERTLARGE, yet consistently outperforms both BERTLARGE and XLNetLARGE.
- This raises questions about the relative importance of model architecture and pretraining objective, compared to more mundane details like dataset size and training time that we explore in this work
- In the second setting (ensembles, test), we submit RoBERTa to the GLUE leaderboard and achieve state-of-the-art results on 4 out of 9 tasks and the highest average score to date.
- This is especially exciting because RoBERTa does not depend on multi-task finetuning, unlike most of the other top submissions.
- We expect future work may further improve these results by incorporating more sophisticated multi-task finetuning procedures.
5.2 SQuAD Results
- While both BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) augment their training data with additional QA datasets, we only finetune RoBERTa using the provided SQuAD training data.
- For SQuAD v1.1 we follow the same finetuning procedure as Devlin et al. (2019).
- For SQuAD v2.0, we additionally classify whether a given question is answerable; we train this classifier jointly with the span predictor by summing the classification and span loss terms.
- RESULT
- Our single RoBERTa model outperforms all but one of the single model submissions, and is the top scoring system among those that do not rely on data augmentation.
- Most of the top systems build upon either BERT (Devlin et al., 2019) or XLNet (Yang et al., 2019), both of which rely on additional external training data. In contrast, our submission does not use any additional data
5.3 RACE Results
- In RACE, systems are provided with a passage of text, an associated question, and four candidate answers. Systems are required to classify which of the four candidate answers is correct.
- We modify RoBERTa for this task by concatenating each candidate answer with the corresponding question and passage.
- We then encode each of these four sequences and pass the resulting [CLS] representations through a fully-connected layer, which is used to predict the correct answer.
- We truncate question-answer pairs that are longer than 128 tokens and, if needed, the passage so that the total length is at most 512 tokens.
- RESULT
6. Related Work
- Pretraining methods have been designed with different training objectives
- language modeling (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018)
- machine translation (McCann et al., 2017)
- masked language modeling (Devlin et al., 2019; Lample and Conneau, 2019)
- Many recent papers have used a basic recipe of finetuning models for each end task (Howard and Ruder, 2018; Radford et al., 2018)
- However, newer methods have improved performance
- multi-task fine tuning (Dong et al., 2019)
- incorporating entity embeddings (Sun et al., 2019)
- span prediction (Joshi et al., 2019)
- multiple variants of autoregressive pretraining (Song et al., 2019; Chan et al., 2019; Yang et al., 2019
- Performance is also typically improved by training bigger models on more data (Devlin et al., 2019; Baevski et al., 2019; Yang et al., 2019; Radford et al., 2019)
- Our goal was to replicate, simplify, and better tune the training of BERT, as a reference point for better understanding the relative performance of all of these methods
7. Conclusion
- We find that performance can be substantially improved by
- training the model longer, with bigger batches over more data;
- removing the next sentence prediction objective;
- training on longer sequences;
- and dynamically changing the masking pattern applied to the training data
- Our improved pretraining procedure, which we call RoBERTa, achieves state-of-the-art results on GLUE, RACE and SQuAD, without multi-task finetuning for GLUE or additional data for SQuAD
- These results illustrate the importance of these previously overlooked design decisions and suggest that BERT’s pretraining objective remains competitive with recently proposed alternatives.
- We additionally use a novel dataset, CC-NEWS
댓글