Area A: Graph-based discourse-semantic processing for heterogeneous document sources
Area A addresses the problem of information processing from a computational linguistics perspective, and as a basis for automatic document summarization from heterogeneous document sources. Discourse-semantic processing encompasses the identification of entities and events mentioned in a discourse, and the characterization of states of affairs from different perspectives such as sentiment or opinion. The way discourse information is encoded is strongly dependent on linguistic principles and needs to be decoded in a language interpretation process that employs both linguistic and world knowledge. An important factor in this decoding process is coherence. Coherence is instrumental in the identification of the linguistic components of the discourse and the determination of linguistic quality in the source documents as well as the target summaries for a human reader (see Area D).
The three guiding themes A1-A3 of Area A focus on complementary phenomena and interact in specific ways with guiding themes of Area B. The different phenomena are: discourse entities (A1), events (A2) and extra-propositional meanings or attitudes (such as opinion or sentiment) ascribed to entities and propositions (A3). Their treatment in the respective guiding themes involves diverse methods: clustering referential expressions into entities within and across documents (A1); identifying and aligning complete event structures within and across documents (A2); partitioning and assigning meaning attributions over subgraphs built for propositions (A3).
PhD projects in this area will deal with the following tasks : In A1, entity linking will identify the concepts mentioned, cross-document coreference resolution will determine whether documents talk about the same entities; in A2, important events need to be identified, and alignment of events across documents will determine whether the same events are reported in different documents; in A3, we need to determine whether events or entities are viewed from different perspectives and opinions.
The PhD projects in this area will approach these guiding themes using appropriate methods. Methods that are especially suited for combining document-internal and cross-document analysis are collective classification approaches and graph-based methods. In interaction with Area C we plan to investigate the concept of small frequent subgraphs or motifs (C1) as a way to capture structural aspects of discourse. As special machine learning techniques we will explore ranking methods (C2) and deep learning (C3).