본문 바로가기
AI/NLP

Fusion In Decoder : Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

by cocacola0 2022. 6. 23.

Fusion In Decoder : Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Paper : https://arxiv.org/pdf/2007.01282.pdf

 

Abstract

  • Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge.
  • While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence.
  • We obtain state-of-theart results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.
  • 생성 모델의 경우  외부 지식에 의지하고 않고 odqa 에서 경쟁력있지만 수십억개의 파라미터를 훈련하고 검색한다는 점에서 매우 고비용
  • 이 논문에서 이러한 모델들이 잠재적인 증거들을 포함하는 passage들을 추출하는데 있어 얼마나 유용한지 조사
  • Natural Questions and TriviaQA benchmark 에서 SOTA 달성
  • passage의 추출 갯수를 늘릴 수록 성능이 상당히 올라감. 이것은 sequence-to-sequence models 이 다양한 passage들로 부터 증거들을 모으고 통합하는데 유연한 framework를 제공할 수 있다는 방증

1 Introduction

  • Recently, several works have shown that factual information can be extracted from large scale language models trained on vast quantities of data (Radford et al., 2019; Petroni et al., 2019; Jiang et al., 2019; Talmor et al., 2019). Building on that observation and the advances in pretraining of natural language processing models, Roberts et al. (2020) introduced a generative model for open domain question answering. Without relying on external knowledge, this method obtained competitive results on several benchmarks. However, it requires models containing billions of parameters, since all the information needs to be stored in the weights. This makes models expensive to query and train. In this paper, we investigate how much this method could benefit from having access to an external source of knowledge, such as Wikipedia.
  • Retrieval based approaches were previously considered in the context of open domain question answering with extractive models (Chen et al., 2017). In that case, systems start by retrieving support documents, before extracting the answer from these documents. Different retrieval techniques have been considered, either using sparse representations based on TF/IDF or using dense embeddings (Guu et al., 2020; Karpukhin et al., 2020). The models which extract the answers are often based on contextualized word representations such as ELMo or BERT (Peters et al., 2018; Devlin et al., 2019), and predict a span as answer. Aggregating and combining evidence from multiple passages is not straightforward when using extractive models, and multiple techniques have been proposed to address this limitation (Clark and Gardner, 2018; Min et al., 2019a).
  • In this paper, we explore a simple approach having the best of both worlds, by building on the exciting developments in generative modeling and retrieval for open domain question answering. This method proceeds in two steps, by first retrieving supporting passages using either sparse or dense representations. Then, a sequence-to-sequence model generates the answer, taking as input the retrieved passages in addition to the question. While conceptually simple, this method sets new state-ofthe-art results on the TriviaQA and NaturalQuestions benchmarks. In particular, we show that the performance of our method significantly improves when the number of retrieved passages increases. We believe that this is evidence that generative models are good at combining evidence from multiple passages, compared to extractive ones.
  • 여러 연구들(Radford et al., 2019; Petroni et al., 2019; Jiang et al., 2019; Talmor et al., 2019)에 따르면, 많은 양의 데이터에서 훈련한 초거대 모델들이 사실적 정보들을 추출할 수 있다는 것을 보임 
  • 특히 Roberts et al. (2020) 는 odqa를 위한 생성모델 고안하였고, 외부 지식 없이 이 방법은 여러 benchmark에서 경쟁력 있는 결과를 보여줌. 하지만 모든 정보들이 weights로 저장됨에 따라 수십억 단위의 parameters가 필요함. 이것은 모델을 훈련하고 검색 하는데 있어 매우 고비용. 이 논문에서는 위키피디아와 같은 외부 데이터에 접근함으로써 해당 방밥이 얼마나 효용이 있는지 연구
  • 1단계로 sparse or dense representations 사용하여 supporting passsage들을 추출, 2단계로 추출한 passages들과 question을 input으로 sequence-to-sequence model 입력하여 답을  생성. 이 방법으로 SOTA 달성. 기존 답을 추출하는 extractive 한 방법론 보다 생성모델들이 다양한 passage들에서 증거들을 통합하는데 더 좋다는 것을 방증한다고 생각 

Figure 1: A simple approach to open domain question answering. First, it retrieves support text passages from an external source of knowledge such as Wikipedia. Then, a generative encoder-decoder model produces the answer, conditioned on the question and the retrieved passages. This approach scales well with the number of retrieved passages, as the performance keeps improving when retrieving up to one hundred passages.

2 Related work

  • Open domain question answering is the task of answering general domain questions, in which the evidence is not given as input to the system. While being a longstanding problem in natural language processing (Voorhees et al., 1999), this task has recently regained interest following the work by Chen et al. (2017). In that version of the problem, strong supervision is available to the learning system, in the form of spans corresponding to answers. Chen et al. (2017) proposed to solve the problem by first retrieving support document from Wikipedia, before extracting the answer from the retrieved document. Different methods were proposed to tackle the setting where no gold spans are given to the system, but only the correct answer. Clark and Gardner (2018) proposed to use a global normalization over all the span corresponding to the answer, which was later applied to BERT based models (Wang et al., 2019). Min et al. (2019a) introduced a method based on hard expectationmaximization to tackle noisy supervision from this setting. Wang et al. (2018b) described a technique to aggregate answers from different paragraphs, using confidence and coverage scores.
  • Passage retrieval is an important step in open domain question answering, and is an active area of research to improve QA systems. Initially, sparse representations based on TF/IDF were used to retrieve support documents (Chen et al., 2017). Lee et al. (2018) introduced a supervised learning method to rerank paragraphs based on BiLSTM, while Wang et al. (2018a) trained a ranking system with reinforcement learning. A second approach to improve the retrieval step of QA systems is to used additional information such as the Wikipedia or Wikidata graphs (Min et al., 2019b; Asai et al., 2020). Recently, multiple works show that retrieval systems entirely based on dense representation and approximate nearest neighbors were competitive with traditional approaches. Such models can be trained using weak supervision in the form of question-answer pairs (Karpukhin et al., 2020), or pretrained using a cloze task and finetuned end-toend (Guu et al., 2020; Lee et al., 2019).
  • Generative question answering was mostly considered in previous work for datasets requiring to generate answers, such as NarrativeQA (Kocisk ˇ y` et al., 2018), CoQA (Reddy et al., 2019) or ELI5 (Fan et al., 2019). These datasets were generated in a way that answers do not correspond to spans in support documents, thus requiring abstractive models. Raffel et al. (2019) showed that generative models are competitive for reading comprehension tasks such as SQuAD (Rajpurkar et al., 2016), where answers are spans. Roberts et al. (2020) proposed to use large pretrained generative models, without using additional knowledge, for open domain question answering. Closest to our work, Min et al. (2020) and Lewis et al. (2020) introduced retrieval augmented generative models for open domain question answering. Our approach differs from these works by how the generative model processes the retrieved passages. This allows to scale to large numbers of documents, and to benefit from this large amount of evidence.
  • Open domain question answering 
    • 증거자료가 시스템에 입력으로 주어지지 않는 질의응답 유형
    • (Voorhees et al., 1999) 매우 오래됬음에도 불구하고, 최근해야 다시 Chen et al. (2017) 주목 받음 
    • Chen et al. (2017) 는 먼저 위키피디아에서 증거 문서를 추출하고 해당 추출문서에서 정답을 추출하는 방식
    • 정답만 주어지고 gold span이 주어지지않은 경우를 해결하기 위해 다양한 방법들이 제안
      • Clark and Gardner (2018) : 모든 정답 span에 대해 global normalization 적용. 후에 BERT 기반 모델들에 적용됨 (Wang et al., 2019)
      • Min et al. (2019a) : 해당 세팅에서 noisy supervision을 해결하게 위해 hard expectationmaximization 를 가반한 방법을 사용 
      • Wang et al. (2018b) : confidence and coverage scores 사용, 서로 다른 단락들에서 정답들을 취합하는 방법을 고안
  • 지문 추출 
    • ODQA 에서 중요한 단계이며 QA 시스템에서 연구가 활발한 분야
    • 초기, TF/IDF 에 기반한 sparse representations 를 사용 (Chen et al., 2017)
    • Lee et al. (2018)  BiLSTM을 사용하여 단락 reranking 하는 지도학습 방법을 고안 
    • 반면 Wang et al. (2018a) 강화학습 기반의 ranking 시스템 고안
    • Wikipedia or Wikidata graphs (Min et al., 2019b; Asai et al., 2020) 와 같은 추가 정보를 사용한 추출 방법이 고안
    • 최근 다양한 연구에서 dense representation and approximate nearest neighbors 을 활용하는 것만으로도 기존 방법들만큼의 경쟁령을 지니고 있음을 보여줌
      • (Karpukhin et al., 2020) : q&a pair를 활용한 약한 지도학습 약한 지도 학습 
      • (Guu et al., 2020; Lee et al., 2019) : cloze task 와 end-to-end fine-tuning
  • 생성질의응답
    • NarrativeQA (Kocisk ˇ y` et al., 2018), CoQA (Reddy et al., 2019) or ELI5 (Fan et al., 2019) 와 같은 정답 생성이 필요한 데이터셋에서만 고려 되었음
    • 이러한 데이터셋의 특징은 정답이 증거 문서들에서의 span과 일치하지 않기(즉, 정답이 증거 문서에서 span으로 나타내어지지 않는 경우) 때문에 추상모델이 필요했음
    • Raffel et al. (2019) : SQuAD (Rajpurkar et al., 2016)와 같은 정답 span이 있는 경우 생성모델들이 MRC task 에서 경쟁력 있다는 것을 보여줌
    • Roberts et al. (2020) : 추가 데이터 없이 생성 모델만을 odqa에 사용하는 방법을 고안
    • Min et al. (2020) and Lewis et al. (2020) : 가장 비슷한 방법들로, 추출 증대 생성모델을 사용. 해당 논문과 의 차이점은 생성모델이 추출한 passages들을 어떻게 처리하는 지에 있음. 해당 방법은 많은 수들의 문서들을 스케일링한 증거들로 부터 이익을 볼 수 있음 

3 Method

  • In this section, we describe our approach to open domain question answering. It proceeds in two steps, first retrieving support passages before processing them with a sequence to sequence model.
  • Retrieval. For the retrieval of support passages, we consider two methods: BM25 (Robertson et al., 1995) and DPR (Karpukhin et al., 2020). In BM25, passages are represented as bag of words, and the ranking function is based on term and inverse document frequencies. We use the implementation from Apache Lucene1 with default parameters, and tokenize questions and passages with SpaCy.2 In DPR, passages and questions are represented as dense vector representations, computed using two BERT networks. The ranking function is the dot product between the query and passage representations. Retrieval is performed using approximate nearest neighbors with the FAISS library.3
  • Reading. Our generative model for open domain QA is based on a sequence-to-sequence network, pretrained on unsupervised data, such as T5 or BART (Raffel et al., 2019; Lewis et al., 2019). The model takes as input the question, as well as the support passages, and generates the answer. More precisely, each retrieved passage and its title are concatenated with the question, and processed independently from other passages by the encoder. We add special tokens question:, title: and context: before the question, title and text of each passage. Finally, the decoder performs attention over the concatenation of the resulting representations of all the retrieved passages. The model thus performs evidence fusion in the decoder only, and we refer to it as Fusion-in-Decoder.
  • By processing passages independently in the encoder, but jointly in the decoder, this method differs from Min et al. (2020) and Lewis et al. (2020). Processing passages independently in the encoder allows to scale to large number of contexts, as it only performs self attention over one context at a time. This means that the computation time of the model grows linearly with the number of passages, instead of quadratically. On the other hand, processing passages jointly in the decoder allows to better aggregate evidence from multiple passages.
  • Retrieval
    • 방법1. Apache Lucene1 기반의 BM25 사용, 토크나이저는 SpaCy2
    • 방법2. DPR 사용. FAISS 사용
  • Reading
    • 각 추출된 passage, title, question 을 concat 해서 encoder에 입력
    • 이때 question:, title:, context: 를 special token 으로서 각각의 앞의 삽입.
    • encoder 를 통한 나온 representation들을 concat 해서 decoder 로 넘김으로서 정답 생성 
    • 해당 방법을 Fusion-in-Decoder라 명칭함
  • passage들을 독립적으로 encoder에서 처리해주고, decoder에서 통합해 처리해준다는 점은
    • 대량의 context 개수를 처리가능 -> 계산비용이 passage에 따라 선형적으로 증가, 제곱대신 
    • decoder 단에서 통합함으로써 decoder가 다양한 passage에서 증거를 축척하는데 좋아짐

FID Architecture

4 Experiments

In this section, we report empirical evaluations of Fusion-in-Decoder for open domain QA.

  • Datasets. We consider the following datasets, and use the same setting as Lee et al. (2019):
    • NaturalQuestions (Kwiatkowski et al., 2019) contains questions corresponding to Google search queries. The open-domain version of this dataset is obtained by discarding answers with more than 5 tokens.
    • TriviaQA (Joshi et al., 2017) contains questions gathered from trivia and quiz-league websites. The unfiltered version of TriviaQA is used for open-domain question answering.
    • SQuAD v1.1 (Rajpurkar et al., 2016) is a reading comprehension dataset. Given a paragraph extracted from Wikipedia, annotators were asked to write questions, for which the answer is a span from the corresponding paragraph.
    • Following Lee et al. (2019) we use the validation as test, and keep 10% of the training set for validation. We use the Wikipedia dumps from Dec. 20, 2018 for NQ and TriviaQA and from Dec. 21, 2016 for SQuAD. We apply the same preprocessing as Chen et al. (2017); Karpukhin et al. (2020), leading to passages of 100 words, which do not overlap.
  • Evaluation. Predicted answers are evaluated with the standard exact match metric (EM), as introduced by Rajpurkar et al. (2016). A generated answer is considered correct if it matches any answer of the list of acceptable answers after normalization. This normalization step consists in lowercasing and removing articles, punctuation and duplicated whitespace.
  • Technical details. We initialize our models with the pretrained T5 models (Raffel et al., 2019), available in the HuggingFace Transformers library.4 We consider two model sizes, base and large, containing respectively 220M and 770M parameters. We fine-tune the models on each dataset independently, using Adam (Kingma and Ba, 2014) with a constant learning rate of 10−4 and a dropout rate of 10%. We train the model for 10k gradient steps, with a batch size of 64, using 64 Tesla V100 32Gb. We evaluate models every 500 steps and select the best one on the validation set based on the Exact Match score. During training on NaturalQuestions and SQuAD, we sample the target among the list of answers, while for TriviaQA, we use the unique human-generated answer. For TriviaQA, answers in uppercase are normalized by converting all letters in lowercase except the first letter of each word, using the title Python string method. For both training and testing, we retrieve 100 passages (unless said otherwise), and truncate them to 250 word pieces. Following the results of Karpukhin et al. (2020), passages are retrieved with DPR for NQ and TriviaQA, and with BM25 for SQuAD. We generate answers by using greedy decoding.
  • Comparison to state-of-the-art. In table 1, we compare the results obtained by Fusion-in-Decoder with existing approaches for open domain question answering. We observe that while conceptually simple, this method outperforms existing work on the NaturalQuestion and TriviaQA benchmarks. In particular, generative models seem to perform well when evidence from multiple passages need to be aggregated, compared to extractive approaches. Our method also performs better than other generative models, showing that scaling to large number of passages and processing them jointly leads to improvement in accuracy. Second, we observe that using additional knowledge in generative models by using retrieval lead to important performance gains. On NaturalQuestions, the closed book T5 model obtains 36.6% accuracy with 11B parameters, while our approach obtains 44.1% with 770M parameters plus Wikipedia with BM25 retrieval. Both methods use roughly the same amount of memory to store information, indicating that text based explicit memories are competitive for knowledge retrieval tasks.
  • Scaling with number of passages. In Figure 3, we report the performance with respect to the number of retrieved passages. In particular, we observe that increasing the number of passages from 10 to 100 leads to 6% improvement on TriviaQA and 3.5% improvement on NaturalQuestions. On the other hand, the performance of most extractive models seems to peak around 10 to 20 passages (Wang et al., 2019; Yang et al., 2019). We believe that this is evidence that sequence-tosequence models are good at combining informations from multiple passages.
  • Impact of the number of training passages. In the previous section, the model was trained and evaluated with the same number of passages. To reduce the training computational budget, a simple solution consists in training the model with fewer passages. In Table 2, we report the performance obtained by training with different numbers of passages, while testing with 100 passages. We observe that reducing the number of training passages leads to a decrease of accuracy. Further, we propose to finetune the previous models using 100 passages for 1000 steps. This allows to reduce the accuracy gap, while using significantly less computational resources: we can reach 46.0 EM on NaturalQuestions, using 147 GPU hours, compared to 425 GPU hours when training on 100 passages.
  • 데이터셋. 아래와 같은 데이터 셋을 사용하고 Lee et al. (2019)와 같은 세팅
    • NaturalQuestions (Kwiatkowski et al., 2019) 정답과 이에 해당하는 구글 검색 쿼리. ODQA 버젼의 이 데이터셋은 5 token 이상의 정답은 제외 
    • TriviaQA (Joshi et al., 2017) trivia and quiz-league websites에서의 질문을 포함. 필터하지 않은 TriviaQA 를 ODQA 용으로 사용 
    • SQuAD v1.1 (Rajpurkar et al., 2016) 은 MRC 데이터셋. 
    • Following Lee et al. (2019) 와 같은 방법으로 validation을 test셋으로, 트레이닝 데이터의 10%를 validation으로 설정. 위키피디아 덤프의 경우 각기  Dec. 20, 2018 for NQ and TriviaQA and from Dec. 21, 2016 for SQuAD 사용. 전처리는 Chen et al. (2017); Karpukhin et al. (2020), 와 같이 100 단어 기준으로 중복 없이 passage를 구성 
  • 평가. 평가 방법으로는 Rajpurkar et al. (2016) 고안한 EM를 사용. 생성 답안은 normalization 이후 정답 리스트에 있는 정답이랑 같으면 정답으로 확인. normalization의 경우 소문자 변환 그리고 관사, 마침표, 공백 제거로 구성 
  • 기술 세부사항.
    • HuggingFace Transformers library 에 있는 T5 모델로 초기화. base, large(220M and 770M parameters) 2개를 고려. 각기 모델들을 독립적으로 Adam(learning rate of 0.0001 ), dropout rate of 10%로 데이터셋에 fine-tune. 
    • 10k gradient steps, with a batch size of 64, using 64 Tesla V100 32Gb.
    • 500 step 마다 validation set 평가 EM score로
    • NaturalQuestions 와 SQuAD 같은 경우, 정답 배열에서 정답 샘플, TriviaQA 같은 경우 사람이 작성한 정답 사용
    • TriviaQA, 같은 경우, 정답에 대문자이면, 첫글짜 빼고 소문자화 
    • training과 testing에서 100개의 passage 추출하고 250 워드 피스로 절단 
    • NQ and TriviaQA같은 경우 DPR 사용, SQuAD 같은 경우 BM25
    • 정답 생성은 greedy decoding 사용
  • SOTA 비교 
    • Fusion-in-Decoder가 NaturalQuestion and TriviaQA benchmarks에서 이전 것들을 성능이 좋음
    • 특히 여러 passage들에서 증거들이 합쳐질때, 생성법이 추출법보다 더 나은 성능을 보여줌 
    • 다른 생성법(T5, GPT-3) 보다 나은 성능을 보여줌 -> 많은 passage들을 축소하고 그것들을 통합 처리해줌으로써 정확도 증가
    • retriever를 사용하여 생성모델에 외부 지식을 추가 전달함으로써으로써 성능향상에 도움됨
    • NaturalQuestions 같은경우 , the closed book T5 model은 11억개의 parameters를 사용해서 36.6% accuracy 이지만 FID는  770M parameters + Wikipedia with BM25 retrieval 로 44.1% accuracy  달성 
    • 두 방법 모두 비슷한 메모리 사용량 이는 텍스트 기반의 명시적 메모리가 지식 추출 tasks에 있어 더욱 경쟁력있다는 것을 나타냄 

FID 의 SOTA 비교 테이블

 

  • Scaling with number of passages
    • 추출 passage 갯수에 따른 성능 변화
    • 10 -> 100 의 경우, TriviaQA는 6% 성능 향상,  NaturalQuestions는 3.5% 성능향상
    • 이와 반대로 추출 방법의 경우 성능이 10 to 20 passages 에서 고착화
    • 이는 sequence-tosequence models 이 여러 passage에서 정보를 규합하는데 좋다는 것에 대한 증거라고 생각

Passage 추출 갯수에 따른 FID 성능

  • training passage 개수의 효과
    • 모델 training과 test시 같은 갯수의 passage 사용
    • 훈련 계산 비용을 줄이기 위해, 더 적은 passage로 훈련을 시도
    • 각기 다른 훈련 passage 개수로 훈련을 시도하고,  test 시에만 100개의 passage를 사용
    • training passage 개수를 줄일시 정확도가 떨어짐
    • 더 나아가, 이전 모델들을 1000step만 100개의 passage들로 finetune
    • 상당히 적은 계산 자원으로 결과 정확도 갭을 줄일수 있었다.
    • 100 passages 훈련시 425 GPU hours가 필요한것에 비해 147 GPU hour 만으로 NaturalQuestions 에서 46.0 EM 달성

FID performance using different number of training passages

5 Conclusion

  • In this paper, we study a simple approach to open domain question answering, which relies on retrieving support passages before processing them with a generative model. We show that while conceptually simple, this approach is competitive with existing methods, and that it scales well with the number of retrieved passages. In future work, we plan to make this model more efficient, in particular when scaling to large number of support passages. We also plan to integrate the retrieval in our model, and to learn the whole system end-to-end.
  • supporting evidence를 추출하여 생성모델에 넘기는 방식으로 매우 간단한 ODQA 방법을 연구 (진짜 간단..)
  • 매우 경쟁력있고 추출 passage개수에 대해 스케일링이 잘됨  
  • 향후 해당 모델을 많은 수의 support passages 의 스케일링을 더 효율적이게 만들고 추출 부분또한 모델이 포함함으로써 end-to-end 시스템을 만들 예정 

댓글