FandomCorpora

Large state-of-the-art corpora for training neural networks to create abstractive summaries are mostly limited to the news genre, as it is expensive to acquire human-written summaries for other types of text at a large scale. We present a novel automatic corpus construction approach to tackle this issue as well as three new large open-licensed summarization corpora based on our approach that can be used for training abstractive summarization models.

Our constructed corpora contain fictional narratives, descriptive texts, and summaries about movies, television, and book series from different domains. All sources use a creative commons (CC) license, hence we can provide the corpora for download. In addition, we also provide a ready-to-use framework that implements our automatic construction approach to create custom corpora with desired parameters like the length of the target summary and the number of source documents from which to create the summary.

The main idea behind our automatic construction approach is to use existing large text collections (e.g., thematic wikis) and automatically classify whether the texts can be used as (query-focused) multi-document summaries and align them with potential source texts.

The essential steps of our approach are: (1) parsing and cleaning of input documents, (2) selecting potential candidates for abstractive summaries from those input documents and assigning summary candidates to them, and (3) choosing the final set of abstractive summaries based upon a new quality threshold and splitting the selected summaries into training, validation, and test set if needed.

Researchers

  Name Contact
Foto Benjamin Hättasch
Dr. rer. nat. Benjamin Hättasch
Postdoctoral Researcher
+49 631 205752900
S2|02 E112

Publications

Loading...
Loading data from TUbiblio…

Error on loading data

An error has occured when loading publications data from TUbiblio. Please try again later.

  • {{ year }}

    • ({{ publication.date.toString().substring(0,4) }}):
      {{ publication.title }}.
      In: {{ publication.series }}, {{ publication.volume }}, In: {{ publication.book_title }}, In: {{ publication.publication }}, {{ publication.journal_volume}} ({{ publication.number }}), ppp. {{ publication.pagerange }}, {{ publication.place_of_pub }}, {{ publication.publisher }}, {{ publication.institution }}, {{ publication.event_location }}, {{ publication.event_dates }}, ISSN {{ publication.issn }}, e-ISSN {{ publication.eissn }}, ISBN {{ publication.isbn }}, {{ labels[publication.type]?labels[publication.type]:publication.type }}
    • […]

Number of items in this list: {{ publicationsList.length }}
Only the {{publicationsList.length}} latest publications are displayed here.

View complete list at TUbiblio View this list at TUbiblio