Joint work on corpora and tools is a central integrative element of AIPHES. Our strategy rests on four main pillars: (1) We compile and maintain the first German MDS reference corpus DBS, (2) we provide new English benchmark corpora for innovative MDS setups, (3) we research methods to compile large heterogeneous summarization corpora fully automatically or using crowdsourcing, and (4) we aim for the highest standards in terms of research data management and therefore set up a training program and rely on an open-source and open-data policy wherever possible.

German MDS reference corpus. Prior to AIPHES, there have not been multi-document summarization corpora for the German language. We fill this gap by compiling a new corpus that consists of 93 summaries and 293 documents from 30 educational topics. Each topic contains about ten heterogeneous documents linked from the Deutsche Bildungsserver platform and between three to four human-written summaries. The reference corpus has first been published at COLING 2016 (Benikova et al., 2016) and already been used in multiple papers, e.g., for interactive summarization (P.V.S. and Meyer, 2017), studying summary evaluation metrics (Peyrard et al., 2017), and sentence regression models (Zopf et al., 2018).

English MDS reference corpus. The Clueweb12-based focused retrieval dataset by Habernal et al. (2016) has similar properties as our German reference corpus: It consists of 50 educational topics with 100 heterogeneous documents each. We have used this data to create a novel benchmark corpus of concept maps which received the best resource/application paper award at EMNLP (Falke and Gurevych, 2017) and a multi-faceted hierarchical summarization corpus (Tauchmann et al., 2018) that we plan to use for aspect-oriented summarization and for claim verification.

Crowdsourcing and automated corpus construction. Both our concept maps and hierarchical summarization corpora have been created using innovative crowdsourcing approaches. Since deep learning methods typically require big training data of multiple thousand summaries, we have also investigated the fully automatic construction of large heterogeneous summarization corpora. Our core idea is to invert the summary generation process by starting with a Wikipedia expert and then finding possible sources on the web (Zopf et al., 2016). Using this methodology, we created the hMDS corpus with 91 English topics and 40 German topics. This reference corpus enabled us to bootstrap a fully automatic version of hMDS yielding a corpus of over 7,000 summaries that is well-suited for training deep learning methods (Zopf, 2018).

Innovative summarization setups. Journalism serves as a major use case in AIPHES in order to apply novel information preparation methods to heterogeneous data from a real-world domain. To this end, we have recently created a corpus of journalistic live blogs (P.V.S. et al., 2018). Live blogs are dynamic news articles about a certain event (e.g., an election). Journalists update them manually during an ongoing event. To provide readers with concise up-to-date information, they regularly write or update a summary which can be seen as a multi-document summary of the individual news items.

In a research collaboration of all AIPHES guiding themes, we automatically annotate MDS corpora with a wide range of summary-relevant phenomena to jointly compare their effectiveness for MDS. A first joint publication has been recently accepted for publication (Zopf et al., 2018).

Research Data Management and Tools

Repository. Our corpora are available to the research community either openly from our GitHub repository or upon request. Since copyright is a severe constraint to foster MDS research, we also worked towards the C4Corpus (Habernal et al., 2016), a sub corpus of Common Crawl containing only documents that have been explicitly put under open licenses.

Tools. Due to the severe lack of NLP tools for the German language, we have created open-licensed preprocessing components (Remus et al., 2016), annotation schemas (Mújdricza-Maydt et al., 2016), and annotation tools (Meyer et al., 2016; Eckart de Castilho et al., 2016) for German.

Training and teaching. To disseminate knowledge within AIPHES, tools and corpora have been a core topic of our retreats and qualification program. In cooperation with the Heidelberg Graduate Academy and the University and State Library Darmstadt, we organized multiple research data management training courses. Also in university teaching, research data management has been a recurring topic. In the 2018 summer term, we organized, for example, a student lab project at TUDA around the hierarchical summarization corpus.

