Our Paper “Summarization Beyond News: The Automatically Acquired Fandom Corpora” was accepted to LREC 2020

We propose a way to automatically construct non-news summarization corpora and create three corpora using that approach

2020/03/24

Large state-of-the-art corpora for training neural networks for creating abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. We present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models. Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movie, television, and book series from different domains. All sources use a creative commons license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents to create the summary from. The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts. We show the usefulness of our approach by running state-of-the-art summarizers and through a manual evaluation with human annotators.

The corpora and the framework code can be obtained from the Fandom Corpus project page. It also contains detailed descriptions of the data formats and framework usage.