AIPHES Corpora and Tools:

AIPHES Corpora and Tools: Overview

Joint work on corpora and tools is a central integrative element of AIPHES. Our strategy rests on four main pillars: (1) We compile and maintain the first German MDS reference corpus DBS, (2) we provide new English benchmark corpora for innovative MDS setups, (3) we research methods to compile large heterogeneous summarization corpora fully automatically or using crowdsourcing, and (4) we aim for the highest standards in terms of research data management and therefore set up a training program and rely on an open-source and open-data policy wherever possible.

AIPHES Corpora and Tools


German MDS reference corpus. Prior to AIPHES, there have not been multi-document summarization corpora for the German language. We fill this gap by compiling a new corpus that consists of 93 summaries and 293 documents from 30 educational topics. Each topic contains about ten heterogeneous documents linked from the Deutsche Bildungsserver platform and between three to four human-written summaries. The reference corpus has first been published at COLING 2016 (Benikova et al., 2016) and already been used in multiple papers, e.g., for interactive summarization (P.V.S. and Meyer, 2017), studying summary evaluation metrics (Peyrard et al., 2017), and sentence regression models (Zopf et al., 2018).

English MDS reference corpus. The Clueweb12-based focused retrieval dataset by Habernal et al. (2016) has similar properties as our German reference corpus: It consists of 50 educational topics with 100 heterogeneous documents each. We have used this data to create a novel benchmark corpus of concept maps which received the best resource/application paper award at EMNLP (Falke and Gurevych, 2017) and a multi-faceted hierarchical summarization corpus (Tauchmann et al., 2018) that we plan to use for aspect-oriented summarization and for claim verification.

Crowdsourcing and automated corpus construction. Both our concept maps and hierarchical summarization corpora have been created using innovative crowdsourcing approaches. Since deep learning methods typically require big training data of multiple thousand summaries, we have also investigated the fully automatic construction of large heterogeneous summarization corpora. Our core idea is to invert the summary generation process by starting with a Wikipedia expert and then finding possible sources on the web (Zopf et al., 2016). Using this methodology, we created the hMDS corpus with 91 English topics and 40 German topics. This reference corpus enabled us to bootstrap a fully automatic version of hMDS yielding a corpus of over 7,000 summaries that is well-suited for training deep learning methods (Zopf, 2018).

Innovative summarization setups. Journalism serves as a major use case in AIPHES in order to apply novel information preparation methods to heterogeneous data from a real-world domain. To this end, we have recently created a corpus of journalistic live blogs (P.V.S. et al., 2018). Live blogs are dynamic news articles about a certain event (e.g., an election). Journalists update them manually during an ongoing event. To provide readers with concise up-to-date information, they regularly write or update a summary which can be seen as a multi-document summary of the individual news items.

In a research collaboration of all AIPHES guiding themes, we automatically annotate MDS corpora with a wide range of summary-relevant phenomena to jointly compare their effectiveness for MDS. A first joint publication has been recently accepted for publication (Zopf et al., 2018).

Research Data Management and Tools

Repository. Our corpora are available to the research community either openly from our GitHub repository or upon request. Since copyright is a severe constraint to foster MDS research, we also worked towards the C4Corpus (Habernal et al., 2016), a sub corpus of Common Crawl containing only documents that have been explicitly put under open licenses.

Tools. Due to the severe lack of NLP tools for the German language, we have created open-licensed preprocessing components (Remus et al., 2016), annotation schemas (Mújdricza-Maydt et al., 2016), and annotation tools (Meyer et al., 2016; Eckart de Castilho et al., 2016) for German.

Training and teaching. To disseminate knowledge within AIPHES, tools and corpora have been a core topic of our retreats and qualification program. In cooperation with the Heidelberg Graduate Academy and the University and State Library Darmstadt, we organized multiple research data management training courses. Also in university teaching, research data management has been a recurring topic. In the 2018 summer term, we organized, for example, a student lab project at TUDA around the hierarchical summarization corpus.

AIPHES Corpora and Tools on GitHub:


Zopf, Markus ; Botschen, Teresa ; Falke, Tobias ; Heinzerling, Benjamin ; Marasovic, Ana ; Mihaylov, Todor ; P. V. S., Avinesh ; Loza Mencía, Eneldo ; Fürnkranz, Johannes ; Frank, Anette (2018):
What's Important in a Text? An Extensive Evaluation of Linguistic Annotations for Summarization.

P. V. S., Avinesh ; Peyrard, Maxime ; Meyer, Christian M. (2018):
Live Blog Corpus for Summarization.
In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), European Language Resources Association, Miyazaki, Japan, [Online-Edition:],

Tauchmann, Christopher ; Arnold, Thomas ; Hanselowski, Andreas ; Meyer, Christian M. ; Mieskes, Margot (2018):
Beyond Generic Summarization: A Multi-faceted Hierarchical Summarization Corpus of Large Heterogeneous Data.
In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), European Language Resources Association, Miyazaki, Japan, [Online-Edition:],

Zopf, Markus (2018):
auto-hMDS: Automatic Construction of a Large Heterogeneous Multilingual Multi-Document Summarization Corpus.
In: Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), [Online-Edition:],

Falke, Tobias ; Gurevych, Iryna (2017):
Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps.
In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, [Online-Edition:],

Peyrard, Maxime ; Botschen, Teresa ; Gurevych, Iryna (2017):
Learning to Score System Summaries for Better Content Selection Evaluation.
In: Proceedings of the EMNLP workshop "New Frontiers in Summarization", Association for Computational Linguistics, Copenhagen, Denmark, September 2017, [Online-Edition:],

P. V. S., Avinesh ; Meyer, Christian M. (2017):
Joint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback.
In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Association for Computational Linguistics, Vancouver, Canada, Volume 1: Long Paper, DOI: 10.18653/v1/P17-1124,

Benikova, Darina ; Mieskes, Margot ; Meyer, Christian M. ; Gurevych, Iryna (2016):
Bridging the gap between extractive and abstractive summaries: Creation and evaluation of coherent extracts from heterogeneous sources.
In: Proceedings of the 26th International Conference on Computational Linguistics (COLING), Osaka, Japan, ISBN 978-4-87974-702-0,

Eckart de Castilho, Richard ; Mújdricza-Maydt, Éva ; Yimam, Seid Muhie ; Hartmann, Silvana ; Gurevych, Iryna ; Frank, Anette ; Biemann, Chris (2016):
A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures.
In: Proceedings of the workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH) at COLING 2016, Osaka, Japan, [Online-Edition:],

Zopf, Markus ; Peyrard, Maxime ; Eckle-Kohler, Judith (2016):
The Next Step for Multi-Document Summarization: A Heterogeneous Multi-Genre Corpus Built with a Novel Construction Approach.
In: Proceedings of the 26th International Conference on Computational Linguistics, The COLING 2016 Organizing Committee, Osaka, Japan, [Online-Edition:],

Meyer, Christian M. ; Benikova, Darina ; Mieskes, Margot ; Gurevych, Iryna (2016):
MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora.
In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016): System Demonstrations, Berlin, Germany, DOI: 10.18653/v1/P16-4017,

Remus, Steffen ; Hintz, Gerold ; Benikova, Darina ; Arnold, Thomas ; Eckle-Kohler, Judith ; Meyer, Christian M. ; Mieskes, Margot ; Biemann, Chris (2016):
EmpiriST: AIPHES Robust Tokenization and POS-Tagging for Different Genres.
In: Proceedings of the 10th Web as Corpus Workshop (WAC-X), Berlin, Germany, DOI: 10.18653/v1/W16-2613,

Habernal, Ivan ; Sukhareva, Maria ; Raiber, Fiana ; Shtok, Anna ; Kurland, Oren ; Ronen, Hadar ; Bar-Ilan, Judit ; Gurevych, Iryna (2016):
New Collection Announcement: Focused Retrieval Over the Web.
In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Pisa, Italy, In: SIGIR '16, ISBN 978-1-4503-4069-4/16/07,
DOI: 10.1145/2911451.2914682,

Habernal, Ivan ; Zayed, Omnia ; Gurevych, Iryna
Calzolari, Nicoletta ; Choukri, Khalid ; Declerck, Thierry ; Grobelnik, Marko ; Maegaard, Bente ; Mariani, Joseph ; Moreno, Asuncion ; Odijk, Jan ; Piperidis, Stelios (Hrsg.) (2016):
C4Corpus: Multilingual Web-size corpus with free license.
In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), Portoroz, Slovenia, [Online-Edition:],

Mújdricza-Maydt, Éva ; Hartmann, Silvana ; Gurevych, Iryna ; Frank, Anette (2016):
Combining Semantic Annotation of Word Sense & Semantic Roles: A Novel Annotation Scheme for VerbNet Roles on German Language Data.
In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, [Online-Edition:],

go to TU-biblio search on ULB website


For further information, please contact Dr. Christian M. Meyer.