QA-Sci-Inf

Motivation

The number of published scientific articles has grown exponentially in the last few decades. In arXiv alone, the number of new submissions per year has increased from 40,000 to 120,000 from 2003 to 2017 in the computer science domain. As a result, it is impossible for researchers to benefit from all published articles and it has become increasingly difficult to explore all relevant information to their research.

In recent years, Natural Language Processing has emerged as a key technology to aid researchers in dealing with the information overload that is presented by these increasing numbers of scientific articles. In this project, we investigate several foundational technologies for Question Answering (QA) for scientific information over different types of data such as tables and texts. This effort will contribute to making the huge amount of information in scientific articles more accessible to researchers.

Goals

Our goal is to provide innovations that induce a shift from general-purpose question answering to specialised, context-aware, hybrid QA systems with generation capabilities. The resulting technology will significantly benefit researchers and provide them with the means to more efficiently explore a huge amount of published articles. Specifically we focus on the following objectives:

Collect a novel QA corpus for the scientific domain in order to enable the development of specialised systems in this area.
Incorporate contextual information in information retrieval. To answer a question about an article, the model needs to capture the article content, which is long and complex.
Collect a table to text dataset based on tables in scientific articles.
Develop reasoning-aware generation models to describe scientific tables. Some questions about scientific articles can be only answered by using tables. Therefore, the model needs to reason over the content of scientific tables and generate corresponding descriptions.

The image below shows a prototypical application of the developed technologies.