Text Analytics: Machine Learning for Text

Text Analytics: Multi-Modal-Commonsense-Reasoning

Course Description

When we, as humans, reason about our world, we use information from multiple modalities to reach conclusions. Our current attempts to teach machines to conduct commonsense reasoning are largely focused on learning from large bodies of text using language modeling objectives. Recent advances in tasks like visual question answering and image captioning have shown performance gains using multi-modal learning, making this area of research a hot-topic. Not only are deep neural models able generate artificial images from text descriptions, but recent work has also shown that models are able to answer complex question referring to images. The question however is, if models are actually able to perform common sense reasoning, or, if the models have learnt the idiosyncrasies of language and simply output the most likely answer due to contextual representations.

In this Seminar we will look at recent research in Natural Language Processing and Computer Vision that combines the two worlds. We will also closely look at relevant work on common sense reasoning to critically ask the question if this is even possible using a single modality.


Seminar kick-off on 15.10 from 15:20-17:00 at S202/C120

Additional material will be distributed via the Moodle eLearning platform. The required passcode will be announced during the first lecture.

Teaching Staff

  • Jonas Pfeiffer
  • Prof. Dr. Iryna Gurevych

Office hours: Tuesday 13:30-15:00


Will be announced during the seminar.


The first sessions will consist of introductory lectures to cover the basics of machine learning methods used for NLP and Computer Vision. The program for the remainder of the seminar will be determined according to the number of participants and will answer the following questions

  • What are the types of modern NLP and Computer Vision systems and what are they composed of?
  • How are they built? (Recent deep learning methods and available resources (tools, datasets) to build such systems)
  • How are the visual and textual modalities combined?
  • What Common Sense?
  • Are machines currently even able to perform common sense reasoning ? (Current limitations and the proposed methods to address them)


When you should send me a request for the office hour: 2 weeks before your presentation (if you are the first week presenter, you can send it 1 week before)

What you should tell me in your e-mail: (1) Preferred half an hour time-slot if you have any preference; (2) Your name and your paper;

When you should send me your presentation draft: As early as possible, not later than 3 days before our meeting