Text Mining & Analytics

Text Mining & Analytics

Overview

In the text mining and text analytics research area, we design algorithms to extract information from unstructured text. These algorithms are used in many contexts, e.g. Digital Humanities, educational research, research about the web 2.0, or information retrieval. We particularly focus on innovative automatized approaches to discover structure in textual documents by means of text classification.

The growing text analytics field heavily relies on supervised text classification to offer services such as sentiment analysis, document categorization, or scientific discovery. In a nutshell, supervised text classification extracts relevant information from manually classified documents and learns a model from the extracted information. Machine learning classifiers learn to take decisions autonomously, so that there is no need to programmatically implement rules that are later used to automatically take decisions.

We apply supervised text classification algorithms to complex language processing problems and novel datasets. In such settings, a textual document is typically enhanced with automatic annotations about grammatical and discourse structure, before the information relevant to the given problem is extracted. To reduce the effort of manually creating training data, we are currently also exploring the use of semi-supervised and unsupervised text mining algorithms.

Beyond supervised text classification for novel language processing tasks, the text mining and analytics area carries out research about:

  • pairwise classification of documents, e.g. email tread disentanglement
  • unsupervised learning on textual data, e.g. clustering Wikipedia editors
  • crowdsourcing annotations on unstructured textual data
  • handling class imbalance, e.g. in edit-turn-pair classification

Current Projects

  • CEDIFOR: This project aims to foster interdisciplinary work between Computer Science and Digital Humanities by providing know-how and research infrastructures for text analytics to humanities researchers in the Rhein-Main area, supporting them to investigate novel research questions. This project is conducted in collaboration with the Goethe-Universität Frankfurt and the German Institute for International Educational Research (DIPF).
  • Audiovisual Content Processing: The goal of this project is the creation of frameworks which facilitate the integration of manual and automatic analysis of audiovisual content, and the identification of the most relevant audiovisual features for different tasks in Digital Humanities. The developed tools will be integrated as audiovisual processing components into the UIMA/DKPro framework.

Past Projects

  • Personality Profiling in Books: For the e-book recommendation systems it can be very helpful to know answers to high-level content questions that readers may have, for example “What is the main hero like?”, “Is the story complicated?” or “Is the book suitable for children?”. The idea of this project is to leverage real-world knowledge resources in order to facilitate estimating answers to such questions with a machine learning system. To reach this goal, the initial research focus lies in identifying suitable approaches to integrate semantic knowledge into the text classification algorithms.
  • VISADOC: This project investigates novel textual features for modeling content-related text properties. It aims to develop an interactive feature engineering approach for complex user-defined semantic properties, as well as visual analysis tools that support the exploration of large document collections with respect to a certain text property.
  • IT Forensics/CASED: New forms of communication in the Web 2.0 are increasingly used for preparing and organizing crimes such as sexual harassment or human trafficking. This project aims to create tools which aid to investigate such crimes. It aims to find relevant documents, identify relevant information bits, and analyze the relations between them.
  • LOEWE TP 2.3: The “Text as Process” research area of the interdisciplinary LOEWE Research Center “Digital Humanities” deals with linguistic properties of collaboratively created texts in the web 2.0. It focusses on the investigation of mass collaboration in online settings by analyzing the quality of content, the history of documents, background discussions, collaboration patterns, and user roles. It this end, the project develops novel datasets, based on article histories and discussion pages from the online encyclopedia Wikipedia.
  • THESEUS TEXO: The THESEUS project strives to develop application oriented base technologies, technical standards, and products, which will allow users and companies to access services, content and knowledge all over the world. TEXO is a use case in the THESEUS program which focuses on the discovery of new services as well as their combination to create new business.
  • Structuring Story Chains: Nearly everyone is struggling to keep up with the larger and larger amounts of information, making this information-overload a major problem in todays society. The news domain is no exception. Since current search engines retrieve information based on keywords and sort the results based on their associated relevance for the entered search query, the large amount of returned articles makes it hard to understand the evolution of an event. In this project, we aim to develop novel methods for structuring news stories in a more coherent way by attempting to discover and model causal connections between articles, present complex news stories in a simpler way and reduce the information-overload.

Completed PhD Theses

Dr. Lucie Flekova

  • Leveraging Lexical-Semantic Knowledge for Text Classification Tasks
  • Technische Universität Darmstadt, 2017.
  • Reviewer: Prof. Dr. Iryna Gurevych
  • Co-reviewers: Prof. Dr. Benno Stein, (Bauhaus Universität Weimar), Prof. Dr. Walter Daelemans, (University of Antwerp)
  • http://tubiblio.ulb.tu-darmstadt.de/89322/

Dr. Johannes Daxenberger

  • The Writing Process in Online Mass Collaboration: NLP-Supported Approaches to Analyzing Collaborative Revision and User Interaction
  • Technische Universität Darmstadt, 2016.
  • Reviewer: Prof. Dr. Iryna Gurevych
  • Co-reviewers: Prof. Dr. Karsten Weihe (TU Darmstadt) and Ofer Arazy, PhD (University of Alberta)
  • http://tubiblio.ulb.tu-darmstadt.de/77229/

Dr. Oliver Ferschke

  • The Quality of Content in Open Online Collaboration Platforms: Approaches to NLP-supported Information Quality Management in Wikipedia
  • Technische Universität Darmstadt, 2014.
  • Reviewer: Prof. Dr. Iryna Gurevych
  • Co-reviewers: Prof. Dr. Hinrich Schütze (LMU München), Assoc. Prof. Carolyn P. Rosé (CMU Pittsburgh)
  • http://tubiblio.ulb.tu-darmstadt.de/65952/

Dr. Niklas Jakob

  • Extracting Opinion Targets from User-Generated Discourse with an Application to Recommendation Systems
  • Technische Universität Darmstadt, 2011.
  • Reviewer: Prof. Dr. Iryna Gurevych
  • Co-reviewer: Prof. Dr. Gerhard Heyer (Universität Leipzig)
  • http://tubiblio.ulb.tu-darmstadt.de/51784/

Resources/Tools

Primary Contact

Dr. Johannes Daxenberger