Data Analysis Software Project for Natural Language

Data Analysis Software Project for Natural Language

TUCaN: 20-00-0948-pp


Project kick-off will be on April 18th 2016.

18.04.2016 – Project kick-off (from 09:50 to 11:30; Room S103/102)

In the kick-off lecture, the context of this project will be presented. Details about the different project options and topics will be described.

Course Content

In this project, different software components will be built that are related to data processing and analysis in the context of automatic question answering.

Automatic question answering (QA) deals with automatically answering questions formulated by humans in natural language. This research field is related to natural language processing (NLP) as well as information retrieval (IR). Questions can range from simple factoid questions (“What is the weight of an iPhone 6s?”) to opinion questions (“What do you think about the iPhone 6s?”). Furthermore, there often are complex questions that span over multiple sentences and contain additional context information. Different portals exist on the web where such complex questions can be asked, for example StackExchange or Such portals contain a large amount of knowledge which can be utilized to automatically provide answers for new and even complex questions. Community question answering (CQA) therefore is different from traditional QA, because it deals with a wide variety of different questions and uses CQA archives as a primary data source.

End-to-end CQA systems consist of multiple individual parts that can be considered as separate research areas. These are for example question type classification, question retrieval, answer selection and answer summarization. Topics of this software project will focus on these individual parts. Participants will develop and implement ideas on how to use existing data from certain CQA datasets to provide solutions related to the different components of a CQA system.


The course is planned with mandatory practice sessions every two weeks during the lecture period. These sessions will be 3 to 4 hours long. The advantage of this format is that challenges can be efficiently tackled during these sessions with direct feedback from the lecturer and other participants. The format however depends on the number of participants.


  • Interest in working with natural language textual data
  • Programming skills (Scala/Python/Java)

Teaching Staff

Andreas Rücklé (Please contact by e-mail for an appointment)

Prof. Dr. Iryna Gurevych