Past Projects

Past Projects by Funding Source

German Research Foundation (DFG)


Learning new languages became very popular in recent years. Although many language learning platforms offer free-of-charge lessons for many languages, their exercises are often not challenging enough, as they are either too easy or too difficult for effective learning. We tackle this issue by adjusting the difficulty of C-test exercises to meet the demands of the learners. C-tests are a special kind of cloze test, in which learners have to complete the second half of every second word. They are frequently used as language proficiency tests, as they allow a learner to train morphological, syntactic, and semantic properties of a language at the same time.


Argumentation mining deals with the automatic identification of arguments and their relations from natural language text. This research project targets at the specific challenges of argumentation mining for the web. We seek to establish foundations of algorithms that apply argument mining to various forms of web argumentation, efficiently leverage the scale of the web, and complement argumentation mining with an argumentation analysis to effectively assess important quality dimensions.

Feature-based Visualization and Analysis of Natural Language Documents (VisADoc)

This project, implemented in cooperation with the University of Konstanz, aims to investigate novel textual features for modeling content-related text properties, to develop an interactive feature engineering approach for complex user-defined semantic properties, and to develop visual analysis tools that support the exploration of large document collections with respect to a certain text property.

Integrating Collaborative and Linguistic Resources for Word Sense Disambiguation and Semantic Role Labeling (InCoRe)

In the InCoRe project, we address the lack of coverage typically associated with lexical semantic resources. The major goal of this project is the integration of various expert-built and collaboratively created lexical semantic resources to a large-scale resource of unprecedented coverage and quality. The second major goal of InCoRe is to scale natural language processing technologies utilizing lexical semantic resources, specifically word sense disambiguation and semantic role labeling, to real-life applications based on the developed resource.

Mining Lexical-Semantic Knowledge from Dynamic and Linguistic Sources and Integration into Question Answering for Discourse-Based Knowledge Acquisition in eLearning (QA-EL)

The project investigates novel applications of dynamic lexical-semantic resources for information search in eLearning. On the one hand, we develop novel ways of mining knowledge from Wikipedia and other Web 2.0 knowledge repositories. On the other hand, we apply question answering in the area of discourse-based knolwedge acquisition in eLearning for the first time.

Feel free to download our QA-EL Flyer.

QA-EduInf: Community-based Question Answering for Educational Information

The project aims at using natural language processing techniques to analyze educational information and answer user questions on various educational topics. Since a large portion of users' questions have already been asked by other people in community question answering forums and answered by educational experts or crowds, we use the available question and answer archives to answer these questions and minimize human effort in searching through educational information. The project consists of different components including question classification, question and answer retrieval, answer quality assessment, and summarization.

Semantic Information Retrieval (SIR-3)

This project systematically investigates the semantic and lexical relationships between words and concepts and its usefulness in information retrieval (IR) tasks. The current phase (III) of the project focuses on the development of large-scale word sense disambiguated multilingual lexical semantic resources and the development of novel semantics-based approaches to cross-lingual IR (CLIR).

Semantic Information Retrieval Part 1 & 2 (SIR)

This project systematically investigates the possible usage of semantic and lexical relationships between words or concepts for improving the information retrieval process. The main focus is on semantic relatedness measures using different knowledge sources (e.g. WordNet, GermaNet, or Wikipedia).

Feel free to download our SIR Flyer

Sentiment Analysis for User-Generated Discourse in eLearning 2.0

The project aims to support easy exploration of subjective content and feedback generation to content providers. We develop components for subjectivity identification, opinion and topic extraction.

Feel free to download our SENTAL Flyer.

Automatic Quality Assessment and Feedback in eLearning 2.0 (AQUA)

The project investigates the use of Natural Language Processing and Machine Learning techniques to automatically measure the quality of user generated textual documents in Web 2.0, such as forum posts, Wikipedia articles, or blog entries. This can be utilized to recommend the user (e.g. the learner) high-quality materials, to implement quality-aware information retrieval, or to predict the popularity of web sites for computational advertising.

Volkswagen Foundation

Educational Web 2.0 (EduWeb)

In the EduWeb project (Lichtenberg-Professorship program), we seek to implement our vision of technology enhanced education of the 21st century. A vast amount of content is produced by many people every day, but despite their interconnection through the World Wide Web, their efforts are often isolated from each other. To overcome this problem, the UKP Lab will provide and explore new algorithms to simplify tedious, recurring tasks as well as improving the coordination with the community.

Loewe Research Center Digital Humanities (State of Hesse)

Text as an Instance

Descriptions of natural language grammars tend to focus on the canonical constructions of a language, yet actual usage also displays constructions that are in different ways marked and thus deviate from the canonical form. The project aims to validate the hypothesis that natural language grammars constitute systems of construction that centered on a set of canonical constructions of a particular language which are complemented by a set of peripheral non-canonical constructions. A contrastive investigation of non-canonical grammatical constructions between English and German is performed using corpus-based methods.

Text as a Process

In this project, we aim at gaining insights into collaboration-, production- and reception processes of collaboratively created Web 2.0 texts. We aim at analyzing the change of collaboratively created texts over time, discovering quality measures and identifying successful collaboration patterns. While focusing onWikipedia as one of the most popular instances of collaboration plattforms, our research results can be generalized to other areas of collaboration in the Web 2.0 and will foster research both in NLP and in the humanities.

Text as Product

This project examines the correspondence of linguistic concepts and automatically extracted topic models.

For our analysis, we annotate a text corpus with lexical cohesion relations and automatically acquire topics. Then, we use LDA topic models to predict lexical cohesion, at this using topic membership of lexical items and significance scores between lexical items to inform an automatic system for lexical chain annotation. Besides aiming at a state-of-the art system for lexical chain identification, we analyse the semiotic interpretability of stochastic methods.

Federal Ministry of Education and Research (BMBF)

Centre for the Digital Foundation of Research in the Humanities, Social, and Educational Sciences (CEDIFOR)

CEDIFOR intends to contribute to bridging the gap between research in the Humanities and computer based methods, and help researchers to master the characteristic problems in this process. It is a Digital Humanities Centre providing methodological expertise for advising researchers from the Humanities, Social, and Educational Sciences on adopting computer based methods in their research. This concerns the planning and operational stage of projects as well as the long-term provision of result data.

Construction of Research Infrastructures for eHumanities


The mission of DARIAH-EU is to enhance and support digitally-enabled research across the arts and humanities. DARIAH aims to develop and maintain an infrastructure in support of research practices based on information and communication technology – so called virtual research environments. The UKP Lab will provide illustrative prototypes and demonstrators specified in collaboration with researchers in the humanities, that will build upon the general infrastructure and best practices developed by DARIAH.

FAMULUS (Fostering diagnostic competence in medical and teacher education via adaptive online-case-simulations)

The interdisciplinary FAMULUS project aims to study how online case simulations that provide automatic adaptive feedback can foster students' diagnostic skills. To generate automatic feedback, we will develop novel methods for identifying and evaluating diagnostic reasoning (e.g. hypothesis generation, evidence generation and evaluation, hypothesis acceptance or rejection) in student essays. The effect of such feedback on the development of diagnostic skills will then be evaluated in a user study with students from medicine and education.


FOCUS is a joint research project within the framework of the BMBF funding program “Digital Change in Education, Science and Research” and is set up in accordance with the funding guidelines for the research of management of research data and its life cycle at universities and other research institutions.

The objective of the project is to develop subject-specific modular training courses in the area of ​​research data management and archiving, to establish them permanently at the respective universities and thus to make offers to the Hessian universities for the subsequent use of the relevant training modules.

IT Forensics (as part of CASED)

This project develops tools to process the natural language in collections of Web 2.0 documents for the identification of fraud and crime. CASED brings together researchers from diverse backgrounds to collaborate on advanced security research. The UKP lab operates the Forensic Linguistics project of CASED, with the goals of creating tools to aid the investigation of crimes on the Web, finding relevant documents using a semantic search, identifying relevant information bits (persons, places, times), and analyzing the relations between them.

Feel free to download our Forensic Linguistics Flyer.

CLARIN-D: Implementation of a web-based annotation platform for linguistic annotations

We develop a web-based tool, which runs in a web browser without further installation effort. We support annotations on several linguistic layers within the same user interface. Further, we realize an interface to crowdsourcing platforms, to be able to scale simple annotation tasks to a large amount of annotators. The annotation platform will be connected to the CLARIN-D infrastructure, to be interoperable with the processing pipelines in WebLicht. The development of the tool is supported by a concurrent second curation project, which defines ‘best practices’ for linguistic annotation on several language layers for different annotator status groups.

Semantic Assistance Services for Career Integration and Personal Competence Development

The SABINE project (German: “Semantische Assistenzdienste für die berufliche Integration und Persönliche Kompetenzentwicklung”) develops methods to interlink the databases of recruitment agencies, personnel services and human resources departments by means of semantic methods. The UKP Lab's contribution will be in methods which extract semantic knowledge from domain-independent sources like Wikipedia by means of statistical text analysis.

Secure Documents using Individual Markers (SiDiM)

The primary goal of the project is to develop novel methods that individualize electronic documents through the manipulation of their textual content that is unrecognizable by a reader. The marks are supposed to be difficult to remove, and at the same time to have no recognizable affect to the meaning of the content. This solution will be embedded in an electronic document distribution environment and remain transparent to an end user.

Semantics- and Emotion-Based Conversation Management in Customer Support (SIGMUND)

The project is concerned with an ultimately new area in the situation-aware support of phone-base customer support: optimizing the work of call center agents through an automatic call monitoring and a dynamic selection and presentation of the relevant documents during the call. Our contribution is in the area of semantic document analysis and context-aware information retrieval.

Feel free to download our SIGMUND Flyer


The project investigates the use of semantic technologies to enable future business value networks. Our main focus is the use of NLP and semantic IR technologies to enable automatic service search and discovery as well as community mining methods to recognize opinions and trends about services.

Feel free to download our THESEUS Texo Flyer

Software Campus Program (BMBF)

Argumentative Writing Support

Formulating persuasive and well-formed arguments is a challenging task and a crucial aspect in writing skills acquisition. However, current writing support is limited to feedback about grammar or spelling and there is no system that provides formative feedback about argumentative writing. In this project, we aim to research novel methods for assisting authors in writing persuasive arguments with respect to the following questions: Is my argument well structured and comprehensible? Are the given reasons relevant for my claim? Does my argument include sufficient support for being persuasive?

Open window

Open window is concerned to give the oportunity for learners to look into interlinked educational content in the World Wide Web. As part of the Open Window project, technologies for automatic linking educational content with different collaboratively created media are developed. These collaborative created media include encyclopedias, such as Wikipedia, and social media services, such as Twitter.

Personality Profiling in Books

For the e-book recommendation systems it can be very helpful to know answers to high-level content questions that readers may have, for example “What is the main hero like?”, “Is the story complicated?” or “Is the book suitable for children?”. The idea of this project is to leverage real-world knowledge resources in order to facilitate estimating answers to such questions with a machine learning system. To reach this goal, the initial research focus lies in identifying suitable approaches to integrate semantic knowledge into the text classification algorithms.

Structuring Story-Chains

Nearly everyone is struggling to keep up with the larger and larger amounts of information, making this information-overload a major problem in todays society. The news domain is no exception. Since current search engines retrieve information based on keywords and sort the results based on their associated relevance for the entered search query, the large amount of returned articles makes it hard to understand the evolution of an event. In this project, we aim to develop novel methods for structuring news stories in a more coherent way by attempting to discover and model causal connections between articles, present complex news stories in a simpler way and reduce the information-overload.

European Commission (EU)


OpenMinTeD aspires to enable the creation of an infrastructure that fosters and facilitates the discovery and use of text mining technologies and interoperable services. It examines several use cases identified by experts from different scientific areas, ranging from generic scholarly communication to literature related to life sciences, food and agriculture, and social sciences and humanities.

Klaus-Tschira Foundation

Wikulu – Self-Organizing Wikis

Wikulu assists the user while creating, editing, or searching content. The self-organizing abilities of the wiki are enabled through Natural Language Processing algorithms like keyphrase extraction, document summarization, document clustering, or graph-based term weighting.

Feel free to download our WIKULU Flyer

Automated Exercise Generation for Language Learners

In a labor market that is increasingly globalized, knowledge of one or even more than one foreign language is more relevant than ever before. New research technologies from the field of natural language processing can support self-directed learning as they offer tools for the assessment of text difficulty and enable the automated generation of adequate exercises.


Together with TU Darmstadt, researchers at DIPF were involved in several projects.

Educational Text Analytics: Automatically Grading Text Response

Initial work in automatically grading text responses using text analytics concerned participation in the 2013 SemEval challenge on textual entailment, applying automatic grading methods to a novel corpus of children's essays, and publishing a survey article on the state of the art of short answer grading methods.

Educational Monitoring on the web – Identifying and Following Educationally Relevant Controversies

The project aimed at putting into practice a monitoring system for public opinion on the most important controversially discussed educationally relevant topics that could be found on the internet. Therefore, tools were developed that provide new opportunities for tracking and analyzing different controversies that are relevant to education.

Knowledge Discovery in Scientific Literature

The main topic of this PhD program was knowledge discovery in the vast amount of scientific literature ubiquitously available on the Web and in historical texts. This research employed methods of intelligent identification and analysis of structures in scientific texts on all scales, enabling completely new, previously unforeseen forms of access to scientific information.

Visualising Complex Data in Education

The project aimed at providing support to educational information and educational research in the assessment of complex data. Focus was placed on natural language data processing as found in many forms in the field of education, such as free text questions in studies or publications.

Automatic Coding of Free Text Formats for Elaborated Educational Measurement

AKTeur pursued the central objective of creating an interdisciplinary co-operation network and a broad technological basis for automated coding of free text responses in manifold scenarios.

Information Extraction From Spoken and Informal Language

In this project various aspects of extracting and using information based on data that contains spoken or informal language were examined. These aspects cover among others elements such as classification (What makes a good answer?) Segmentation and Keyphrase extraction (in the context of transcripts of school lessons) and Summarization of data for Eduserver. In each sub-project either the source of the data was from the educational domain or the goal was to provide information or tools to researchers in the area of educational research.

Contextualized Information Processing of Educational Content Using Automatic Summarization (Methods)

Subject to a co-operation with DIPF, a Master’s thesis at Technical University Darmstadt presented first approaches to developing a method offering machine-based support of manual summaries of data collections in the field of educational research.

Innovative services for educational research data

The innovative services for educational research data were aimed at reducing the manual effort of analysis invested by educational researchers, thus enabling them to gain more time for the actual research matters. Research in this area was conduced in cooperation with DIPF.

Long-term UKP team projects

Darmstadt Knowledge Processing (DKPro) Repository

The DKPro Repository consists of a growing number of scalable, robust and flexible UIMA components for various kinds of NLP tasks such as tokenization, sentence splitting, PoS tagging, negation detection, lexical chaining, word pair extraction.

Feel free to download our DKPro Flyer.

UBY – Large-scale Sense-linked Lexical-semantic Resource

UBY is a large-scale lexical-semantic resource for natural language processing (NLP) based on the ISO standard Lexical Markup Framework (LMF). Most UBY related software is developed open source on Google Code. UBY combines a wide range of information from expert-constructed and collaboratively constructed resources for English and German.


Ambient Semantic Computing (ASC)

Video lectures, audio recordings, wiki content, and forum entries are often seen as separate entities. The goal of ASC is the integration of these multimodal content streams by combining techniques from Natural Language Processing and Human Computer Interfaces.

Feel free to download our ASC Flyer

Processing of Audiovisual Content: Integration of Automatic and Manual Analysis

The main topic of this project is on the application of machine learning techniques in audiovisual content from the digital humanities. This research employs well established methods from areas such as natural language processing, speech signal processing and computer vision in audiovisual recordings used in research from the humanities, such as psychology, communication sciences and pedagogy.

Utilizing Web Knowledge: Language Technologies and Psychological Processes

The project funded by the Board for Interdisciplinary Research of TU Darmstadt examines the usefulness of selected, innovative language technologies according to psychological processes and models. This research project will provide important groundwork by bringing together scientists from computer science, industrial science, and psychology.

Welt der Kinder

The digital humanities project “Welt der Kinder” funded by the SAW program of the Leibniz Society started in May 2014 and is designed as a test model for future similar projects. By very close cooperation between historians, information scientists, and computer scientists, it aims to gain new insights about the period from 1850 until 1918, a time in which an accelerated production of knowledge was dominated by both globalization and nationalisation at the same time

This poster gives an overview of the project context, its goals, and its methods.