Nearly everyone is struggling to keep up with the larger and larger amounts of information, making this information-overload a major problem in today’s society. The news domain is no exception. It is not uncommon for large news portals to publish several articles per day on a certain topic. However, this makes it for readers often impossible to stay up-to-date on each topic and on each new development.
Current search engines retrieve information given keywords and sorts the results based on their associated relevance for the entered search query. However, the large amount of returned articles, up to several Million articles, makes it hard to understand the evolution of an event as well as the causal effects it has.
In this project, we aim to develop novel methods for structuring news stories in a more coherent way:
- How can we discover and model causal connections between articles?
- How can we present a complex news story to a reader in an effective way?
- How can we reduce the information-overload and present only the information of interest?
The goal of this project is to develop various software components which help to deal with the information-overload in the news domain. The components can either address the readers, e.g. by recommending articles, visualizing connections between articles, or summarizing the development of a story, or they can address the editors.
Research in this projects spans over different methods:
1. Finding (causal) connections between articles based on their content as well as on existent meta-data like links between articles
2. Creation of an user interface: Presenting discovered connections in an easy accessible way and make them searchable to fit user needs
3. Automatic extraction and highlighting of updated information based on individual reading history
Events are a key component of story-chains. News articles mainly report about events, how these are connected and the consequences of certain events. A large chunk of the project was dedicated to improve the state-of-the-art in terms of automatic event detection and extraction.
During this project, a new event nugget detection system was developed and submitted to the Event Nugget Detection shared task at TAC 2015. The system was placed first among 14 systems in terms of event nugget detection and realis classification and ranked fourth among event classification. The source code is publicly available.
Events are inseparably connected to time: Each event happens at a given point in time. Extracting this event time is beneficial for a lot of applications, including information extraction, summarization, and knowledge base population. However, extracting the event time is difficult, as it is only seldom mentioned in news articles. Current approaches in the annotation of the event time have disadvantages in terms of feasibility as well as in quality. The annotation is hugely time-consuming and still misses the most relevant information for a majority of the events. As shown in the publication Temporal Anchoring of Events for the TimeBank Corpus, for a majority of the events, the temporal information is not mentioned closed to the event. Often, it can be found several sentences before or after the mention of the event. This publication proposes a new annotation scheme, which is more efficient and provides a better quality. We annotated data of the TimeBank corpus with this new scheme. The new corpus can be downloaded.
During the course of this project, several master thesis were supervised. Ziyang Li finished his master thesis on Related Articles Discovery in Large Corpora and Michael Bräunlein finished his thesis on Multi-Document High Precision Event Extraction.
- Prof. Dr. Iryna Gurevych, Principal Investigator
- Nils Reimers, Doctoral Researcher
- Holtzbrinck Publishing Group
- ZEIT Online
The project was funded by the Federal Ministry of Education and Research (BMBF).