Audiovisual Processing

Processing of Audiovisual Content – Integration of Automatic and Manual Analysis


The main topic of this project is on the application of machine learning techniques in audiovisual content from the digital humanities. This research employs well established methods from areas such as natural language processing, speech signal processing and computer vision in audiovisual recordings used in research from the humanities, such as psychology, communication sciences and pedagogy.

The research question being posed by this project is whether audiovisual content from the digital humanities can be automatically classified by machine learning methods. The goal of the project is to automate, or at least bootstrap, the analysis performed by researchers from the humanities. In order to reach this goal, this project requires intensive research on unsupervised and semi-supervised machine learning methods for representation learning of multimodal data.


This project is working currently with two different domains: 1) In the psychology domain, psychology researchers try to understand what visual cues do pre-service teachers take into consideration when judging students' personality. The judgment is done by looking at 30 seconds videos of the students doing activities in class; 2) In the communication sciences domain, communication science researchers analyse and manually annotate cues present in the body language, prosody and speech of politicians in a debate. Their goal is to investigate what elements are more important in the persuasiveness of an audience.

Challenges to tackle:

  • Automatic classification of audiovisual content from the digital humanities;
  • Machine learning techniques for small audiovisual datasets from digital humanities with very high-level abstract annotations;
  • Identification of which modalities (audio, video or text) are more important for each particular domain analysed in this project.


We preprocess our data using well established open source tools from the speech processing comunity and from the computer vision community. For the machine learning algorithms, we also use open source frameworks. In order to tackle our issues concerning scarse amounts of data, we investigate semi-supervised machine learning techniques to improve our results. And since neural networks achieve outstanding results for most of the tasks related to vision and language, we focus on the usage of neural networks in this project. Techniques for fusioning different modalities of a dataset, such as feature-level fusion and decision-level fusion are also investigated.


  • Prof. Dr. Iryna Gurevych, Principal Investigator
  • Pedro Santos, Doctoral Researcher


  • Caroline V. Wahle, Doctoral Researcher, Graduate School for Teaching and Learning Processes (UPGRADE), University of Koblenz-Landau
  • Prof. Dr. Marcus Maurer, Professor for Communication Sciences, Johannes Gutenberg University of Mainz

Student theses

  • Kunal Saxena (supervised by Pedro Santos and Prof. Dr. Iryna Gurevych). Automatic Prediction of Debaters' Success: A case-study on Intelligence Squared. MSc final thesis, 2016.
  • Paul Michael Burkhardt (supervised by Pedro Santos and Prof. Dr. Iryna Gurevych) Automatic prediction of students' personality, self-concept and intelligence. BSc final thesis, 2016


The project is funded by the FAZIT-Stiftung.