본문 바로가기
AI/NLP

DPR: Dense Passage Retrieval for Open-Domain Question Answering

by cocacola0 2022. 5. 2.

DPR: Dense Passage Retrieval for Open-Domain Question Answering

Code : https://github.com/facebookresearch/DPR

Paper : https://arxiv.org/pdf/2004.04906.pdf

Abstract

  •  Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method.
  • In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dualencoder framework.
  • When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong LuceneBM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks.
  • 전통적이 sparse vector space model 인 TF-IDF or BM25 가 passage retrieval에 있어 효과적인 방법임
  • dense representation 만을 훈련한 모델로 다양한 odqa dataset에서 LuceneBM25 system 대비 정확도가  9%-19% 향상된 SOTA 달성 

1. Introduction

Open-domain question answering (QA) 

  • task that answers factoid questions using a large collection of documents
  • early QA systems are often complicated and consist of multiple components (Ferrucci (2012); Moldovan et al. (2003), inter alia),
  • the advances of reading comprehension models suggest a much simplified two-stage framework: (1) a context retriever first selects a small subset of passages where some of them contain the answer to the question, and then (2) a machine reader can thoroughly examine the retrieved contexts and identify the correct answer (Chen et al., 2017)
  • Although reducing open-domain QA to machine reading is a very reasonable strategy, a huge performance degradation is often observed in practice2 , indicating the needs of improving retrieval.
  • odqa는 많은 양의 문서에서 factoid question 에 대해 답을 하는 task
  • 초창기 qa 시스템은 복잡하고 다양한 구성요소 (IBM Watson) 로 이루어져 있었으나, RC의 발전으로 (1) 답을 지닌 passage를 추출하는 retreiver 와 (2) 추출된 context 를 바탕으로 정답을 맞추는 reader 로 2-stage framework로 간소화됨 
  • 위와 같은 접근법은 매우 합리적이나 , retriever 에 따라 성능이 매우 저조    

Retrieval

  • Retrieval in open-domain QA is usually implemented using TF-IDF or BM25 (Robertson and Zaragoza, 2009), which matches keywords efficiently with an inverted index and can be seen as representing the question and context in highdimensional, sparse vectors (with weighting).
  • Conversely, the dense, latent semantic encoding is complementary to sparse representations by design. For example, synonyms or paraphrases that consist of completely different tokens may still be mapped to vectors close to each other.
  • Consider the question “Who is the bad guy in lord of the rings?”, which can be answered from the context “Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.” A term-based system would have difficulty retrieving such a context, while a dense retrieval system would be able to better match “bad guy” with “villain” and fetch the correct context. Dense encodings are also learnable by adjusting the embedding functions, which provides additional flexibility to have a task-specific representation.
  • With special in-memory data structures and indexing schemes, retrieval can be done efficiently using maximum inner product search (MIPS) algorithms (e.g., Shrivastava and Li (2014); Guo et al. (2016))
  • opdq의 retrieval 의 경우 통상적으로 TF-IDF or BM25 를 사용하며, 질문과 지문을 sparse vector로 변환하여 키워드를 매칭하는데 매우 효과적이다.
  • 반대로  dense latent semantic encoding(dense representation) 같은 경우, 전혀 다른 토큰들로 구성된 동의어 또는 의역이들이 투영된 공간상에서 가깝게 매칭된다.
  • 예를 들어, 질문 : “Who is the bad guy in lord of the rings?”, 지문 : “Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.” 에서 bad guy와 villain의 토큰 매칭은 없지만 dense encoding 같은 경우 학습가능하다. 

Dense Retrieval Hurdles

  • However, it is generally believed that learning a good dense vector representation needs a large number of labeled pairs of question and contexts.
  • Dense retrieval methods have thus never be shown to outperform TF-IDF/BM25 for opendomain QA before ORQA (Lee et al., 2019), which proposes a sophisticated inverse cloze task (ICT) objective, predicting the blocks that contain the masked sentence, for additional pretraining. The question encoder and the reader model are then finetuned using pairs of questions and answers jointly.
  • Although ORQA successfully demonstrates that dense retrieval can outperform BM25, setting new state-of-the-art results on multiple open-domain arXiv:2004.04906v3 [cs.CL] 30 Sep 2020 QA datasets, it also suffers from two weaknesses.
  • First, ICT pretraining is computationally intensive and it is not completely clear that regular sentences are good surrogates of questions in the objective function. Second, because the context encoder is not fine-tuned using pairs of questions and answers, the corresponding representations could be suboptimal.
  • 하지만 이러한 dense vector representaion을 얻기 위해서는 많은 양의 label 된 question 과 context 데이터들이 필요하다고 알려져 있다.
  • ORQA(Lee et al., 2019) 이전에는 TF-IDF/BM25 가 odqa 에서 항상 높은 성능을 보여주있다. 이는 dense retriever가 bm25 를 성능을 앞선다는 것을 보여준다. 하지만 ORQA 또한 2가지 약점이 있다.
    1. 첫째로,  Inverse Cloze Task 의 초기훈련 비용이 매우 크다는 것이며 또한 평서형 문장들이 목적 함수에 있어 질문형 문장을 잘 대체할수 있다고 명백하지 않다는 것이다.
    2. 두번째로 context encoder 의 경우 질문과 정답으로 훈련되어 있지 않기 때문에 representation이 suboptimal 할수 있다는 점이다. 

Dense Passage Retriever (DPR)

  • In this paper, we address the question: can we train a better dense embedding model using only pairs of questions and passages (or answers), without additional pretraining By leveraging the now standard BERT pretrained model (Devlin et al., 2019) and a dual-encoder architecture (Bromley et al., 1994), we focus on developing the right training scheme using a relatively small number of question and passage pairs.
  • Through a series of careful ablation studies, our final solution is surprisingly simple: the embedding is optimized for maximizing inner products of the question and relevant passage vectors, with an objective comparing all pairs of questions and passages in a batch.
  • Our Dense Passage Retriever (DPR) is exceptionally strong. It not only outperforms BM25 by a large margin (65.2% vs. 42.9% in Top-5 accuracy), but also results in a substantial improvement on the end-to-end QA accuracy compared to ORQA (41.5% vs. 33.3%) in the open Natural Questions setting (Lee et al., 2019; Kwiatkowski et al., 2019).
  • Our contributions are twofold. First, we demonstrate that with the proper training setup, simply fine-tuning the question and passage encoders on existing question-passage pairs is sufficient to greatly outperform BM25. Our empirical results also suggest that additional pretraining may not be needed. Second, we verify that, in the context of open-domain question answering, a higher retrieval precision indeed translates to a higher end-to-end QA accuracy. By applying a modern reader model to the top retrieved passages, we achieve comparable or better results on multiple QA datasets in the open-retrieval setting, compared to several, much complicated systems
  • 이 논문에서는 1) BERT 와 dual-encoder architecture 그리고 2) pretraining 없이 적은 질문과 지문을 활용하여 더나은 dense embedding  mdoel을 훈련하는지 제시한다.
  • batch 안에서 MIPS를 object function으로 하여 훈련
  • DPR 좋다.
  • 기여한 점
    • pretraining 없이 적은 질문과 지문만으로 bm25를 뛰어넣을 수 있었다
    • 다양한 qa data 에서 odqa setting 으로 좋은 성과를 내었다.

2. Background 

  • The problem of open-domain QA studied in this paper can be described as follows. Given a factoid question, such as “Who first voiced Meg on Family Guy?” or “Where was the 8th Dalai Lama born?”, a system is required to answer it using a large corpus of diversified topics.
  • More specifically, we assume the extractive QA setting, in which the answer is restricted to a span appearing in one or more passages in the corpus. Assume that our collection contains D documents, d1, d2, · · · , dD. We first split each of the documents into text passages of equal lengths as the basic retrieval units3 and get M total passages in our corpus C = {p1, p2, . . . , pM}, where each passage pi can be viewed as a sequence of tokens w (i) 1 , w (i) 2 , · · · , w (i) |pi| . Given a question q, the task is to find a span w (i) s , w (i) s+1, · · · , w (i) e from one of the passages pi that can answer the question.
  • Notice that to cover a wide variety of domains, the corpus size can easily range from millions of documents (e.g., Wikipedia) to billions (e.g., the Web).
  • As a result, any open-domain QA system needs to include an efficient retriever component that can select a small set of relevant texts, before applying the reader to extract the answer (Chen et al., 2017).4
  • Formally speaking, a retriever R : (q, C) → CF is a function that takes as input a question q and a corpus C and returns a much smaller filter set of texts CF ⊂ C, where |CF | = k  |C|. For a fixed k, a retriever can be evaluated in isolation on top-k retrieval accuracy, which is the fraction of questions for which CF contains a span that answers the question.
  • 이 논문에서 odqa 의 문제점은 다양한 주제의 수많은 말뭉치 중에서 factoid question에 대한 답을 추출해야 된다는 것이다.
  • 예를 들어 데이터가 document_1, documnt_2, .. document_d. 이 있다고 가정하자. 여기서 일정 길이 기준의 지문으로 자른다. 이것을 passage_1, passage_2, .... , passage_m 이라고 한다. 또한 각각의 p_i 들은 w_1, w_2, ...., w_k 로 token 으로 나뉜다.
  • 그럼 토큰 관점에서 생각을 해보면 odqa는 수많은 document를 자른 수많은 passage에서 수많은 토큰들 중에 답변을 추출해내야 한다. 
  • 즉 retriever 는 C={p1,p2, .. p_m} 에서 주어진 질문을 바탕으로 C_f (smaller set) with fixed size of top-k 를 추출해야 되는데 이때 C_f 는 답변을 포함하고 있어야 된다.

 

3. Dense Passage Retriever (DPR)

3.1 Overview

  • Our dense passage retriever (DPR) uses a dense encoder EP (·) which maps any text passage to a ddimensional real-valued vectors and builds an index for all the M passages that we will use for retrieval. At run-time, DPR applies a different encoder EQ(·) that maps the input question to a d-dimensional vector, and retrieves k passages of which vectors are the closest to the question vector. We define the similarity between the question and the passage using the dot product of their vectors
  • Although more expressive model forms for measuring the similarity between a question and a passage do exist, such as networks consisting of multiple layers of cross attentions, the similarity function needs to be decomposable so that the representations of the collection of passages can be precomputed.
  • Most decomposable similarity functions are some transformations of Euclidean distance (L2). For instance, cosine is equivalent to inner product for unit vectors and the Mahalanobis distance is equivalent to L2 distance in a transformed space.
  • Inner product search has been widely used and studied, as well as its connection to cosine similarity and L2 distance (Mussmann and Ermon, 2016; Ram and Gray, 2012). As our ablation study finds other similarity functions perform comparably (Section 5.2; Appendix B), we thus choose the simpler inner product function and improve the dense passage retriever by learning better encoders.
  • Encoders - Although in principle the question and passage encoders can be implemented by any neural networks, in this work we use two independent BERT (Devlin et al., 2019) networks (base, uncased) and take the representation at the [CLS] token as the output, so d = 768.
  • Inference - During inference time, we apply the passage encoder EP to all the passages and index them using FAISS (Johnson et al., 2017) offline. FAISS is an extremely efficient, open-source library for similarity search and clustering of dense vectors, which can easily be applied to billions of vectors. Given a question q at run-time, we derive its embedding vq = EQ(q) and retrieve the top k passages with embeddings closest to vq.
  • 질문과 passge 간의 유사도 점수를 측정하기 위해 아래와 같인 정의하였다. 

similariy score for question and passage

  • 다양한 방법들이 존재하지만, similarity 함수의 경우, passage들의 embedding 을 미리 계산하기 위해 분해가능해야 된다.
  • 이러한 분해가능한 함수들 중 L2을 포함한 변형함수 등이 있다. cosine similarity 같은 경우 단위 벡터의 inner product 와 동일하며 Mahalanobis distance  같은 경우 L2 와 동등하다.
  • Inner product 같은 경우 cosine similarity 와 L2 distance와의 관계되 있고 많이 연구 되어졌다. 또한 다른 함수들의 성능들과 유사하기에 간단한 inner product 를 dpr를 훈련시키는데 사용했다.
  • 인코더 같은 경우 BERT를 사용 임베딩은 [CLS] 을 사용했다.
  • Inference의 FAISS모듈을 사용했다 (빠르고 좋음) 
    1. text is corrupted with an arbitrary noising function,
    2. a sequence-to-sequence model is learned to reconstruct the original text

3.2 Training

  • Let D = {hqi , p+ i , p− i,1 , · · · , p− i,ni}m i=1 be the training data that consists of m instances. Each instance contains one question qi and one relevant (positive) passage p + i , along with n irrelevant (negative) passages p − i,j . We optimize the loss function as the negative log likelihood of the positive passage

loss fucntion for training DPR

  • (질문, 질문에 대한 정답 passage, 정답아닌 passage1, 정답아닌 passage2, ..., 정답아닌 passage_m} 이 주어졌을때 위에 정의 similarity score를 사용 negative loss likelihood 로 loss function을 정의한다. 

 

Positive and negative passages

  • For retrieval problems, it is often the case that positive examples are available explicitly, while negative examples need to be selected from an extremely large pool.
  • For instance, passages relevant to a question may be given in a QA dataset, or can be found using the answer. All other passages in the collection, while not specified explicitly, can be viewed as irrelevant by default.
  • In practice, how to select negative examples is often overlooked but could be decisive for learning a high-quality encoder. We consider three different types of negatives: (1) Random: any random passage from the corpus; (2) BM25: top passages returned by BM25 which don’t contain the answer but match most question tokens; (3) Gold: positive passages paired with other questions which appear in the training set.
  • We will discuss the impact of different types of negative passages and training schemes in Section 5.2. Our best model uses gold passages from the same mini-batch and one BM25 negative passage. In particular, re-using gold passages from the same batch as negatives can make the computation efficient while achieving great performance. We discuss this approach below.
  • retriever 에서 negative passage 를 선택하는 중요하며, 3가지 type 의 negative passage 를 정의한다.
    1. Random : Corpus 에서 랜덤하게 뽑은 passage
    2. bm25 : bm25 에서 출력한 passage 중에서 답변을 포함하고 있지 않은 상위 passage
    3. gold : 훈련셋에서 다른 질문들에 대한 답변 
  • 가장 좋았던 모델은 같은 batch내에서의 gold-passage 들과 1개의 bm25 negative passage를 쓴것. 특히 batch 안에서 다른 질문에 대한 정답 passage를  해당질문의 gold-passgae로 사용한 것은 연산적으로 매우 효율적  

In-batch negatives

  • Assume that we have B questions in a mini-batch and each one is associated with a relevant passage. Let Q and P be the (B×d) matrix of question and passage embeddings in a batch of size B. S = QPT is a (B × B) matrix of similarity scores, where each row of which corresponds to a question, paired with B passages. In this way, we reuse computation and effectively train on B2 (qi , pj ) question/passage pairs in each batch. Any (qi , pj ) pair is a positive example when i = j, and negative otherwise. This creates B training instances in each batch, where there are B − 1 negative passages for each question.
  • The trick of in-batch negatives has been used in the full batch setting (Yih et al., 2011) and more recently for mini-batch (Henderson et al., 2017; Gillick et al., 2019). It has been shown to be an effective strategy for learning a dual-encoder model that boosts the number of training examples.
  • 위에서 설명한 batch-negative 설명
  • batch 안에 b 개의 질문과 b개의 정답 pasage가 있음. 각각은 q=(b,d), p=(b,d) 사이즈임. d = model dimension. 이때 q*p_transpose 할경우, (b,b) 형태가 되며, 이때 해당 matrix 의 대각선은 질문에 대한 정답 passage의 similarity score로 계산되고 대각선 요소가 아닐경우 질문과 negative passage의 similarty score 가 됨
  • 이 값들을 그대로 앞서서 언급한 loss. function 에 넣으면 끝 
  • 이러한 in-batch-negative 기법은 dual-encoder model를 훈련하는데 있어 효과적인 방법임 
  • 추가로 리뷰 세션에서 이 in-batch-negative 랑 질문에 gold passage을 사용한 것이랑 같이 않냐는 질문이 나와서 아래와 같이 설명
    • case1) 그냥 훈련한 경우 -> 첫번째 epoch에서 (q1, p1, p2) 가 학습되고 두번째 epoch 에서 또한 마찬가지로  (q1, p1, p2) 학습
    • case2) in-batch negative -> 첫번째 epoch에서 (q1, p1, p2) 가 학습 두번째 epoch  에서 배치에 따라 다른 질문에 대해 다른 negative passage를 선택 : (q1, p1, p3) -> 더 많은 예제를 모델에게 제공! 

4. Experimental Setup

4.1 Wikipedia Data Pre-processing

  • Following (Lee et al., 2019), we use the English Wikipedia dump from Dec. 20, 2018 as the source documents for answering questions. We first apply the pre-processing code released in DrQA (Chen et al., 2017) to extract the clean, text-portion of articles from the Wikipedia dump. This step removes semi-structured data, such as tables, infoboxes, lists, as well as the disambiguation pages. We then split each article into multiple, disjoint text blocks of 100 words as passages, serving as our basic retrieval units, following (Wang et al., 2019), which results in 21,015,324 passages in the end.5 Each passage is also prepended with the title of the Wikipedia article where the passage is from, along with an [SEP] token.
  • 위키피이다 전처리 방법 소개 ORQA 와 같은 영어 위키피디아 사용
  • DrQA 전처리 사용
  • Article를 100 단어 단위로 자름(basic retrieval units) -> 21,015,324 passages
  • passage 형태 : 제목 + [SEP] + 내용   

4.1 Question Answering Datasets

  • We use the same five QA datasets and training/dev/testing splitting method as in previous work (Lee et al., 2019).
  • Below we briefly describe each dataset and refer readers to their paper for the details of data preparation.
    • Natural Questions (NQ) (Kwiatkowski et al., 2019) was designed for end-to-end question answering. The questions were mined from real Google search queries and the answers were spans in Wikipedia articles identified by annotators.
    • TriviaQA (Joshi et al., 2017) contains a set of trivia questions with answers that were originally scraped from the Web.
    • WebQuestions (WQ) (Berant et al., 2013) consists of questions selected using Google Suggest API, where the answers are entities in Freebase.
    • CuratedTREC (TREC) (Baudis and ˇ Sediv ˇ y`, 2015) sources questions from TREC QA tracks as well as various Web sources and is intended for open-domain QA from unstructured corpora.
    • SQuAD v1.1 (Rajpurkar et al., 2016) is a popular benchmark dataset for reading comprehension. Annotators were presented with a Wikipedia paragraph, and asked to write questions that could be answered from the given text. Although SQuAD has been used previously for open-domain QA research, it is not ideal because many questions lack context in absence of the provided paragraph. We still include it in our experiments for providing a fair comparison to previous work and we will discuss more in Section 5.1.
  • 5개 QA dataset 설명 

4.3 Selection of positive passages

  • Because only pairs of questions and answers are provided in TREC, WebQuestions and TriviaQA6 , we use the highest-ranked passage from BM25 that contains the answer as the positive passage.
  • If none of the top 100 retrieved passages has the answer, the question will be discarded.
  • For SQuAD and Natural Questions, since the original passages have been split and processed differently than our pool of candidate passages, we match and replace each gold passage with the corresponding passage in the candidate pool.7 We discard the questions when the matching is failed due to different Wikipedia versions or pre-processing. Table 1 shows the number of questions in training/dev/test sets for all the datasets and the actual questions used for training the retriever.

Summary of ODQA dataset

  • TREC, WebQuestions, TriviaQA6 : 질문과 정답만이 주어져 있어 bm25 로 추출한 passage 들중 정답인 있는 것을 positive passage 로 채택
  • SQuA, Natural Questions : original passage를 앞서 전처리한 풀의  positive passage를 대체. 경우에 따라 positive passage가 존재하지 않으면 해당 질문 제거 

 

5. Experiments: Passage Retrieval

  • The DPR model used in our main experiments is trained using the in-batch negative setting (Section 3.2) with a batch size of 128 and one additional BM25 negative passage per question.
  • We trained the question and passage encoders for up to 40 epochs for large datasets (NQ, TriviaQA, SQuAD) and 100 epochs for small datasets (TREC, WQ), with a learning rate of 10−5 using Adam, linear scheduling with warm-up and dropout rate 0.1.
  • While it is good to have the flexibility to adapt the retriever to each dataset, it would also be desirable to obtain a single retriever that works well across the board. To this end, we train a multidataset encoder by combining training data from all datasets excluding SQuAD.8
  • In addition to DPR, we also present the results of BM25, the traditional retrieval method9 and BM25+DPR, using a linear combination of their scores as the new ranking function. Specifically, we obtain two initial sets of top-2000 passages based on BM25 and DPR, respectively, and rerank the union of them using BM25(q,p) + λ · sim(q, p) as the ranking function. We used λ = 1.1 based on the retrieval accuracy in the development set.
  • 모델 setting 밑줄 참고

5.1 Main Results

  • Table 2 compares different passage retrieval systems on five QA datasets, using the top-k accuracy (k ∈ {20, 100}). With the exception of SQuAD, DPR performs consistently better than BM25 on all datasets. The gap is especially large when k is small (e.g., 78.4% vs. 59.1% for top-20 accuracy on Natural Questions). When training with multiple datasets, TREC, the smallest dataset of the five, benefits greatly from more training examples. In contrast, Natural Questions and WebQuestions improve modestly and TriviaQA degrades slightly.
  • Results can be improved further in some cases by combining DPR with BM25 in both single- and multi-dataset settings.
  • We conjecture that the lower performance on SQuAD is due to two reasons. First, the annotators wrote questions after seeing the passage. As a result, there is a high lexical overlap between passages and questions, which gives BM25 a clear advantage. Second, the data was collected from only 500+ Wikipedia articles and thus the distribution of training examples is extremely biased, as argued previously by Lee et al. (2019).
  • 아래의 표를 보면 squad 데이터를 제외하고 bm25 보다 다 높은 성능을 보여주는 것을 알수 있다. 특히 k가 클수록 그 차이는 많이 난다.

Retrival performance on 5 QA dataset

  • multi-data set 으로 training 한 경우, 데이터가 적은 TREC 같은 경우 매우 성능이 높아진다
  • dpr + bm25을 통해 성능을 좀더 높일 수 있음
  • squad 의 성능이 낮의 이유 2가지
    1. 주석다는 사람들이 passage를 보고 나서 질문을 작성한다는 점 -> high lexcial overlap -> bm25 better
    2. squad 데이터는 wikipedia 500여개 article에 대해서만 수집되어 훈련데이터가 매우 편향적임.  

5.2 Ablation Study on Model Training

 

Sample efficiency

  • Sample efficiency We explore how many training examples are needed to achieve good passage retrieval performance. Figure 1 illustrates the top-k retrieval accuracy with respect to different numbers of training examples, measured on the development set of Natural Questions. As is shown, a dense passage retriever trained using only 1,000 examples already outperforms BM25. This suggests that with a general pretrained language model, it is possible to train a high-quality dense retriever with a small number of question–passage pairs. Adding more training examples (from 1k to 59k) further improves the retrieval accuracy consistently.
  • 효과적인 sampling 갯수를 정하기 위한 실험 진행. 아래에 표에서 알수 있듯이, 1000 개의 예제만으로 bm25를 능가, 훈련 데이터가 많아질 수록 추출 성능이 올라감 

top-k result over different number of training examples

In-batch negative training

  • We test different training schemes on the development set of Natural Questions and summarize the results in Table 3. The top block is the standard 1-of-N training setting, where each question in the batch is paired with a positive passage and its own set of n negative passages (Eq. (2)). We find that the choice of negatives — random, BM25 or gold passages (positive passages from other questions) — does not impact the top-k accuracy much in this setting when k ≥ 20.
  • The middle bock is the in-batch negative training (Section 3.2) setting. We find that using a similar configuration (7 gold negative passages), in-batch negative training improves the results substantially. The key difference between the two is whether the gold negative passages come from the same batch or from the whole training set. Effectively, in-batch negative training is an easy and memory-efficient way to reuse the negative examples already in the batch rather than creating new ones. It produces more pairs and thus increases the number of training examples, which might contribute to the good model performance. As a result, accuracy consistently improves as the batch size grows.
  • Finally, we explore in-batch negative training with additional “hard” negative passages that have high BM25 scores given the question, but do not contain the answer string (the bottom block). These additional passages are used as negative passages for all questions in the same batch. We find that adding a single BM25 negative passage improves the result substantially while adding two does not help further.
  • 아래 표에 대한 설명
  • (q1, p1, p2, p3, p4, p5, p6, p7, p8) 의 훈련 instance 에서 (p2,p3, ... p8) 을 뽑는 방법 설명 p1은 q1에 대한 정답 지문
  • 첫번째 블록 은 1)랜덤 지문, 2) BM25에서 추출해지만 답안을 포함하지 않은 지문 3)다른 질문에 대한 답안을 포함하고 있는 지문 각각 을 설정했으나 top-20 부터 성능차이가 거의 없음 
  • 두번째 블록은 in-batch-negative를 사용 이 경우 in-batch-negative를 사용하지 않은 첫번째 블록보다 높은 성능을 보여 주었음. 위에서 설명했던 대로 더 많은 training instance 가 기여한것으로 보임. 배치사이즈 늘리면 성능도 좋아짐 
  • 세번째 블록은 hard_negative 에 대한 설명. (q1,p1) pair 에다가 hard_negative(p_hard_negative) 쌍으로 표현 (q1,p1,n1) 이라하자.
  • 또한 배치사이즈가 32 라 하면 (q1,p1,n1), ..,(q32,p32,n32)라고 배치에 들어가고
  • in batch negative 를 사용하면 1 training instance 는 (q1, p1, , p2,....,p32, n1, n2, ,3,..., n32) -> 질문 1개당 1개 정답 지문, 31개의 gold passage negative, 32 hard_negative 가 된다. 아래의 표화 같은 내용. 질문당 hard_negative 를 2개씩 붙이면 32*2 = 64 이므로 31개의 gold passage negative, 64 hard_negative 가 된다.
  • 하나만 붙이는 것을 추천 2개 이상부터 성능이 엄청 높아지지는 않는다. 

comparison of different training schemes

Impact of gold passages

  • We use passages that match the gold contexts in the original datasets (when available) as positive examples (Section 4.2). Our experiments on Natural Questions show that switching to distantly-supervised passages (using the highest-ranked BM25 passage that contains the answer), has only a small impact: 1 point lower top-k accuracy for retrieval. Appendix A contains more details.
  • 데이터 풀에서 BM25를 사용하여 정답이 존재하는 지문 추출 vs 원래 주어진 지문 사용
  • 데이터 풀 을 구축시 실제 구축한 지문데이터와 주어진  지문이 일치하면 (즉 존재하면) 해당 원래 지문을 사용.
  • 그 결과 그렇지 않은 경우 와 별반 차이가 없었음. 1 point lower 
  • 즉 데이터 구축이 매우 어려울시(위키피디아 정제해서 데이터 뽑고 지문 풀을 만드는 길고 긴 과정), 훈련데이터와 실제 뽑는 데이터가 distantly-supervised passages 면 사용해도 괜찮음.

Similarity and loss

  • Besides dot product, cosine and Euclidean L2 distance are also commonly used as decomposable similarity functions. We test these alternatives and find that L2 performs comparable to dot product, and both of them are superior to cosine. Similarly, in addition to negative loglikelihood, a popular option for ranking is triplet loss, which compares a positive passage and a negative one directly with respect to a question (Burges et al., 2005). Our experiments show that using triplet loss does not affect the results much. More details can be found in Appendix B.
  • Similary Function  : dot product = Euclidean L2 > cosine
  • loss function : negative loglikelihood = triplet loss 
  • Appendix 보면 DT + NNL 이 제일 좋은 성능을 보여줌

Cross-dataset generalization 

  • One interesting question regarding DPR’s discriminative training is how much performance degradation it may suffer from a non-iid setting. In other words, can it still generalize well when directly applied to a different dataset without additional fine-tuning? To test the cross-dataset generalization, we train DPR on Natural Questions only and test it directly on the smaller WebQuestions and CuratedTREC datasets. We find that DPR generalizes well, with 3-5 points loss from the best performing fine-tuned model in top-20 retrieval accuracy (69.9/86.3 vs. 75.0/89.1 for WebQuestions and TREC, respectively), while still greatly outperforming the BM25 baseline (55.0/70.9).
  • dpr 이 non iid setting 에서도 잘 동작할까? 
  • 다른 말로 특정 데이터 셋에서 훈련한 dpr이 fine-tunning 전혀 없이 전형 다른 데이터셋에서도 잘 동작할까?
  • 잘된다. top-20 기준 3~5 밖에 안떨어지며 bm25 baseline 보다 훨씬 좋음 

5.3 Qualitative Analysis

  • Although DPR performs better than BM25 in general, passages retrieved by these two methods differ qualitatively. Term-matching methods like BM25 are sensitive to highly selective keywords and phrases, while DPR captures lexical variations or semantic relationships better. See Appendix C for examples and more discussion.
  • dpr bm25 각기 장점이 있다
  • bm25 - term matching! 어휘 매칭 
  • dpr - lexical variations or semantic relationship 변형 어휘 나 의미론적 관계 
  • 데이터 셋에 따라 무엇을 사용할지 잘고려!

5.4 Run-time Efficiency

  • The main reason that we require a retrieval component for open-domain QA is to reduce the number of candidate passages that the reader needs to consider, which is crucial for answering user’s questions in real-time. We profiled the passage retrieval speed on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory. With the help of FAISS in-memory index for real-valued vectors10 , DPR can be made incredibly efficient, processing 995.0 questions per second, returning top 100 passages per question. In contrast, BM25/Lucene (implemented in Java, using file index) processes 23.7 questions per second per CPU thread.
  • On the other hand, the time required for building an index for dense vectors is much longer. Computing dense embeddings on 21-million passages is resource intensive, but can be easily parallelized, taking roughly 8.8 hours on 8 GPUs. However, building the FAISS index on 21-million vectors on a single server takes 8.5 hours. In comparison, building an inverted index using Lucene is much cheaper and takes only about 30 minutes in total
  • FASISS inferernce에는 빠르나 indexing 이 오래 걸림
  • Lucene 은 그 반대 ㅎ

6. Experiments: Question Answering

6.1 End-to-end QA System

  • We implement an end-to-end question answering system in which we can plug different retriever systems directly. Besides the retriever, our QA system consists of a neural reader that outputs the answer to the question. Given the top k retrieved passages (up to 100 in our experiments), the reader assigns a passage selection score to each passage. In addition, it extracts an answer span from each passage and assigns a span score. The best span from the passage with the highest passage selection score is chosen as the final answer. The passage selection model serves as a reranker through crossattention between the question and the passage. Although cross-attention is not feasible for retrieving relevant passages in a large corpus due to its nondecomposable nature, it has more capacity than the dual-encoder model sim(q, p) as in Eq. (1). Applying it to selecting the passage from a small number of retrieved candidates has been shown to work well (Wang et al., 2019, 2018; Lin et al., 2018).
  • Specifically, let Pi ∈ R L×h (1 ≤ i ≤ k) be a BERT (base, uncased in our experiments) representation for the i-th passage, where L is the maximum length of the passage and h the hidden dimension. The probabilities of a token being the starting/ending positions of an answer span and a passage being selected are defined as:
  • ) where Pˆ = [P [CLS] 1 , . . . , P [CLS] k ] ∈ R h×k and wstart, wend, wselected ∈ R h are learnable vectors. We compute a span score of the s-th to t-th words from the i-th passage as Pstart,i(s) × Pend,i(t), and a passage selection score of the i-th passage as Pselected(
  • . During training, we sample one positive and m˜ −1 negative passages from the top 100 passages returned by the retrieval system (BM25 or DPR) for each question. m˜ is a hyper-parameter and we use m˜ = 24 in all the experiments. The training objective is to maximize the marginal log-likelihood of all the correct answer spans in the positive passage (the answer string may appear multiple times in one passage), combined with the log-likelihood of the positive passage being selected. We use the batch size of 16 for large (NQ, TriviaQA, SQuAD) and 4 for small (TREC, WQ) datasets, and tune k on the development set. For experiments on small datasets under the Multi setting, in which using other datasets is allowed, we fine-tune the reader trained on Natural Questions to the target dataset. All experiments were done on eight 32GB GPU
  • retriever 로 뽑은 passage 를 question과 cross-attention을 통해 점수를 매기고 가장 높은 점수를 받은  passage에서 가장 높은 점수를 받은 정답을 선택
  • cross-attention 은 분해가능하지 않는 특성으로 인해 전체 corpus에서  passage를 추출하는 실질적인 옵션이 되지 못하지만 top-k로 추출된 비교적 적은 passage candidates 에 대해 잘 작동하는 것을 보여주었음 
  • 아래와 같이 score 산정
    • w_ -> learnable vectors
    • P_i : L*h (length of ith passage * hidded dimension)
    • P_hat : cls token embedding of each retrieved passages

Score equations for answer span in each passage and passage ranker

  • one positive passage + 24 negative passage(selected from bm25 or dpr) -> one training instance 
  • maximize loglikelihood on reader + loglikelihood on reranker
    • loglikelihood on (P_start * P_end) + loglikelihood(P_selected) 
  • 위와 같이 실험 수행 

6.2 .2 Results

  • Table 4 summarizes our final end-to-end QA results, measured by exact match with the reference answer after minor normalization as in (Chen et al., 2017; Lee et al., 2019). From the table, we see that higher retriever accuracy typically leads to better final QA results: in all cases except SQuAD, answers extracted from the passages retrieved by DPR are more likely to be correct, compared to those from BM25. For large datasets like NQ and TriviaQA, models trained using multiple datasets (Multi) perform comparably to those trained using the individual training set (Single). Conversely, on smaller datasets like WQ and TREC, the multidataset setting has a clear advantage. Overall, our DPR-based models outperform the previous stateof-the-art results on four out of the five datasets, with 1% to 12% absolute differences in exact match accuracy. It is interesting to contrast our results to those of ORQA (Lee et al., 2019) and also the concurrently developed approach, REALM (Guu et al., 2020). While both methods include additional pretraining tasks and employ an expensive end-to-end training regime, DPR manages to outperform them on both NQ and TriviaQA, simply by focusing on learning a strong passage retrieval model using pairs of questions and answers. The additional pretraining tasks are likely more useful only when the target training sets are small. Although the results of DPR on WQ and TREC in the single-dataset setting are less competitive, adding more question–answer pairs helps boost the performance, achieving the new state of the art.
  • To compare our pipeline training approach with joint learning, we run an ablation on Natural Questions where the retriever and reader are jointly trained, following Lee et al. (2019). This approach obtains a score of 39.8 EM, which suggests that our strategy of training a strong retriever and reader in isolation can leverage effectively available supervision, while outperforming a comparable joint training approach with a simpler design (Appendix D).
  • One thing worth noticing is that our reader does consider more passages compared to ORQA, although it is not completely clear how much more time it takes for inference. While DPR processes up to 100 passages for each question, the reader is able to fit all of them into one batch on a single 32GB GPU, thus the latency remains almost identical to the single passage case (around 20ms). The exact impact on throughput is harder to measure: ORQA uses 2-3x longer passages compared to DPR (288 word pieces compared to our 100 tokens) and the computational complexity is superlinear in passage length. We also note that we found k = 50 to be optimal for NQ, and k = 10 leads to only marginal loss in exact match accuracy (40.8 vs. 41.5 EM on NQ), which should be roughly comparable to ORQA’s 5-passage setup.
  • 아래와 표 설명

End-to-End performance on ODQA

  • Retriever 가 좋을 수록 결과가 좋음
  • (squad 예외) DPR > BM25
  • 큰 데이터 셋은 multiple dataset 훈련에서도 single dataset 훈련과 상응하는 결과를 나타냄
  • 작은 데이터 셋은 확실히 multiple dataset 훈련을 통해 향상된 성능을 보여줌
  • ORQA 와 REALM 의 pretraining 없이도 좋은 성능을 보여줌
  • 다만 작은 데이터 셋에대헤서는 pretraining 필요할 수 있으나, 질문-정답만 더 훈련해주면(즉 multiset에서는)  더 높은 성능을 보여줌
  • reader + retriever 을 joing learning 하는 것보다 따로따로 학습한 것이 더 높은 성능 
  • DPR reader 은 ORQA 보다 더 많은 passages들을 고려 
  • 처리량은 정확힉 측정하기가 어려움 (ORQA 2~3배 longer passage) 

7. Related Work

  • Passage retrieval has been an important component for open-domain QA (Voorhees, 1999). It not only effectively reduces the search space for answer extraction, but also identifies the support context for users to verify the answer. Strong sparse vector space models like TF-IDF or BM25 have been used as the standard method applied broadly to various QA tasks (e.g., Chen et al., 2017; Yang et al., 2019a,b; Nie et al., 2019; Min et al., 2019a; Wolfson et al., 2020). Augmenting text-based retrieval with external structured information, such as knowledge graph and Wikipedia hyperlinks, has also been explored recently (Min et al., 2019b; Asai et al., 2020).
  • The use of dense vector representations for retrieval has a long history since Latent Semantic Analysis (Deerwester et al., 1990). Using labeled pairs of queries and documents, discriminatively trained dense encoders have become popular recently (Yih et al., 2011; Huang et al., 2013; Gillick et al., 2019), with applications to cross-lingual document retrieval, ad relevance prediction, Web search and entity retrieval.  Such approaches complement the sparse vector methods as they can potentially give high similarity scores to semantically relevant text pairs, even without exact token matching. The dense representation alone, however, is typically inferior to the sparse one. While not the focus of this work, dense representations from pretrained models, along with cross-attention mechanisms, have also been shown effective in passage or dialogue re-ranking tasks (Nogueira and Cho, 2019; Humeau et al., 2020). Finally, a concurrent work (Khattab and Zaharia, 2020) demonstrates the feasibility of full dense retrieval in IR tasks. Instead of employing the dual-encoder framework, they introduced a late-interaction operator on top of the BERT encoders.
  • Dense retrieval for open-domain QA has been explored by Das et al. (2019), who propose to retrieve relevant passages iteratively using reformulated question vectors. As an alternative approach that skips passage retrieval, Seo et al. (2019) propose to encode candidate answer phrases as vectors and directly retrieve the answers to the input questions efficiently. Using additional pretraining with the objective that matches surrogates of questions and relevant passages, Lee et al. (2019) jointly train the question encoder and reader. Their approach outperforms the BM25 plus reader paradigm on multiple open-domain QA datasets in QA accuracy, and is further extended by REALM (Guu et al., 2020), which includes tuning the passage encoder asynchronously by re-indexing the passages during training. The pretraining objective has also recently been improved by Xiong et al. (2020b). In contrast, our model provides a simple and yet effective solution that shows stronger empirical performance, without relying on additional pretraining or complex joint training schemes.
  • DPR has also been used as an important module in very recent work. For instance, extending the idea of leveraging hard negatives, Xiong et al. (2020a) use the retrieval model trained in the previous iteration to discover new negatives and construct a different set of examples in each training iteration. Starting from our trained DPR model, they show that the retrieval performance can be further improved.  Recent work (Izacard and Grave, 2020; Lewis et al., 2020b) have also shown that DPR can be combined with generation models such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2019), achieving good performance on open-domain QA and other knowledge-intensive tasks. 
  • ODQA history

8. Conclusion

  • In this work, we demonstrated that dense retrieval can outperform and potentially replace the traditional sparse retrieval component in open-domain question answering. While a simple dual-encoder approach can be made to work surprisingly well, we showed that there are some critical ingredients to training a dense retriever successfully. Moreover, our empirical analysis and ablation studies indicate that more complex model frameworks or similarity functions do not necessarily provide additional values. As a result of improved retrieval performance, we obtained new state-of-the-art results on multiple open-domain question answering benchmarks.

 

댓글