Introduction

Question and Answering system is an Information retrieval system which answers for a user's natural language question, from a set of documents fed to the system.

Problem Definition

Manually reading documents and finding answer to the question is suitable when the count of documents is minimal. When the number of documents is more, finding answer to a question is a tedious task.  The objective is to design an application that finds answer to a question by referring several documents within a corpus of documents.

System Design

The System has 2 components:

  1. Document Indexer
  2. Retriever

Document Indexer:

Given a pile of documents,  the system will capture document name, page wise paragraphs and index them into Elastic search.  This facilitates very quick retrieval during automated Q&A.  This is one time mandatory process for  bringing in the documents into the system.

Retriever:

After indexing process is done, a user's natural language question can be queried. The retriever finds answer for the corresponding question. The retriever does the following two tasks -  identifying top 'n' relevant paragraphs for the given question across corpus, and  detecting the answer from the identified paragraphs.

1. Elasticsearch based Paragraph Retriever

Based on the natural language query, the system will fetch relevant contextual paragraph using Elasticsearch indexed metadata.  

2. BERT based Answer detector

Once set of contextual paragraphs are identified,  their features are extracted and  fed to ML model. The model that we have used  for this task is Transformer based BERT model which is fine tuned over Squad data set .

For every paragraph, this model will predict an answer with corresponding probability score. The answers with top-n probability values, are suggested as detected answers to the given question.

Applications

  • Unstructured Information retrieval such as Finding law corresponding to scenario from Court Summary document.
  • Finding information about a clause specified in Terms and Conditions/Contracts documents of an organisation.

Observations:

The pre-trained BERT model fine tuned over SQUAD 2.0 data set, is analysed for Question and answering. The performance of the system depends on the nature of the question being asked. The information on various types of questions can be referred here.

We have observed that our system performs well at finding answer for factoid type questions. The questions in SQUAD 2.0 dataset are more of factoid type.

The project source code can be accessed here.

Machine Comprehension with pytorch-transformers
Step-by-step guide to finetune and use question and answering models with pytorch-transformers