This blog gives overview of Question and Answering problem, various categories of Question and answers, and existing datasets on question and answering.


Document Level Question and Answering is an active area of research in Information Retrieval and Natural Language Processing. It involves finding answer for a natural language question. The objective of the problem is to find answer for a question being asked given the context. The context may be a paragraph or a document or a set of documents. It can be seen as Reading Comprehension problem too [4].

Different kinds of Question and Answering

The Question and Answer problem, can be categorized from the perspective of questions, and answers. They are as below.

Based on Questions

  • Factoid Questions - Questions that can be answered with simple facts expressed in short text answers are said to be factoid questions [1].

    • The answers are short strings expressing a personal name, location or a temporal expression. Temporal expression denotes time, specifically - a particular point in time such as duration or frequency. The answers for factoid questions are named entities.

    • Example

      • personal name

        • Who is the Prime Minister of the country X ?
      • location

        • What is the capital of the country X?
      • temporal expression

        • When was the scientist X born?
        • How often Presidency elections are conducted in the Country X?
  • Non Factoid Questions - Questions that ask for finding complex answer such as descriptions, opinions, explanations, suggestions or intrepretations, which are mostly passage-level texts, are said to be non-factoid questions. Causal questions, Hypothetical questions and Complex questions falls under non factoid questions.

    1. Causal questions - questions that need answers which are descriptions about an entity, such as reasons, explanations, elaborations etc, related to particular objects or events, are said to be causal questions [2].

      • What-questions

        • Exampe: What caused twentieth-century revolutions ?
      • Why-questions

        • Example: Why humans get cancer ?
      • How-questions

        • Example: How do male penguins survive without eating
          for four months ?
    2. Hypothetical questions - questions that needs information associated to any hypothetical event, are said to be hypothetical questions [2]. These questions does not have specific answers. These questions are based on events that could happen, thus requiring the respondent to express how he or she would handle a specific event or respond to a specific situation that has not occurred, but, hypothetically, could occur.

      • Example : If you could be the CEO of any company, what company would you choose?
    3. Complex Questions - questions that need some reasoning/mathematical analysis/deduction, for anwsering, are said to be Complex questions [2].

      • Example : A person P is the sister of Q. Q is the daughter of R and R is the son of S. If T is the father of S, then how is Q related to T, and how is P related to S?
  • Other Categories

    1. List Type questions - questions that need a list of facts or entities as answers, are said to be list type of questions [2].
      • Example : List names of movies in 2017?
    2. Confirmation questions - questions that need answers in the form of yes or no, are said to be confirmation questions [2].
      • Example: Is Abdul Kalam a scientist?

Based on Answering method

  • Extractive answering:
    If the answering model produce answers that are copied word by word from the context, then it is called Extractive answering. In other words, it selects phrases or sentences from context as answers.

    To find answers, the answering model has to go through the context. The answers may be directly found as is or may be scattered across the context. Sometimes, Coreference resolution is required.

    • Answers with Multi hoping
      • Answer to a question is spread across the context, and context has to be followed for intermediate answers to arrive at the final answer.
      • Sample Data : Wikihop
    • Answers with Coreference resolution
      • The context will have answers mentioned in their coreferred form. While answering, the mentions have to be replaced with their correferred term to arrive at the answer.
  • Abstractive/Generative answering:
    The answering model can rewrite the information in the context documents as needed.

Based on Answers required

  • Cloze style - Cloze style answering means that a missing word has to be inferred. In this case, the question includes a blank that must be filled as an answer, which has to be inferred.

  • Detail style - Detail Style answering means the answer must be extracted or generated according to the context for a question.

Based on Source of answer in a Question answering System

  • Open domain Question Answering

    • The questions are not limited to predefined domains; Ideally, the system should be able to search through a very large amount of text documents to find the answer for us. These type of Question answering system are said to be Open Domain Question Answering [2].
    • Sample data: Squad
  • Closed domain Question Answering

    • When questions are bound to a specific domain, then the question answering sytem is Closed/Restricted Domain question answering system [2].
    • Sample Data:PubmedQA
      • Pubmed comprises of biomedical literature, life science journals.

Existing Datasets

Existing Datasets for Question and answering from the year 2018 to 2019 are below.

Period Dataset Name Description
Jun 2018 SQuAD 2.0 Dataset having context as paragraph from set of wikipedia articles paragraph, and detail Style Question Answer pair
Aug 2018 QuAC Dataset having context as a article and 'question and answer' pair forms a Dialog, which seeking information over series of questions
Aug 2018 CoQA Dataset with context as a text passage and answer a series of interconnected questions that appear in a conversation
Aug 2018 ShARC Dataset having questions which are underspecified (answer information cant be retrieved directly) and can be answered using use the supporting rule text in the context of conversation
Sep 2018 HotpotQA Dataset having questions can be answered by extracting relevant facts and perform necessary comparison over several paragraphs of a context
Jan 2019 Natural Questions Dataset Dataset having context as an entire Wikipedia article, and that may or may not contain the answer to the question.
Jan 2019 MedQuAD Dataset having context specific to diseases, drugs and other medical entities
Apr 2019 DROP Dataset having questions can be ansered by resolving references in the question and perform discrete operations over them (such as addition, counting, or sorting)
Jul 2019 ELI5 Dataset having questions require explanatory multi-sentence answers
Jul 2019 TweetQA Dataset having tweets used by journalists to write news articles
Sep 2019 PubMedQA Dataset having context specifc to Biomedical domain
Sep 2019 WIQA Datasets having questions contains a perturbation and a possible effect in the context of a paragraph

The dataset for question and answering comprises of context, question and answer. In the earlier days, the question-answer pair in question and answering dataset, were having significantly higher number of factoid questions. Due to analysis, and advancements in research, eventually the datasets are collected intentionally to solve a particular type/category of question-answer such as datasets for multi-hop answering, coreferred answering and answering over reasoning.

Refer this Spreadsheet for some additional information such as base paper, source of data, supporting organization.


  1. A Survey on Machine Reading Comprehension Systems
  2. A Survey on Types of Question Answering System
  3. Spreadsheet having Question and Answering Datasets information during the period from 2013 to 2016
  4. Machine Reading comprehension Vs Question Answering