GeMTeX – German Medical Text Corpus


In everyday clinical practice, numerous texts are produced, such as doctors' letters and reports, which contain valuable information about the development, course, and treatment of a disease. These texts could be used by natural language processing (NLP) tools to assist doctors and researchers in their work. However, the full potential of clinical documents cannot be realised due to a lack of standardisation. The GeMTeX (German Medical Text Corpus) methodology platform aims to fill this gap and make medical texts from patient care available for research projects. The goal is to create the largest medical text corpus in the German language.

Within the framework of GeMTeX, six university medical centres in Munich, Leipzig, Essen, Berlin, Dresden and Erlangen are collecting documents from electronic patient files (ePA) with the consent of the patients. These documents are annotated using INCEpTION, the annotation platform developed and maintained by UKP Lab. Using natural language processing, the documents are processed in compliance with data protection regulations and made available in anonymized form for joint use. This creates a valuable text repertoire for research and development.

In addition, GeMTeX will create a central technical and organisational structure to collect anonymized texts and process them for enrichment according to guidelines. The resulting text database can be used to train AI models and test their usefulness in everyday clinical practice.


  • Prof. Dr. Iryna Gurevych, Principal Investigator
  • Dr.-Ing. Richard Eckart de Castilho, Postdoctoral Researcher
  • Serwar Basch, MSc, Doctoral Researcher


The GeMTeX project started on 1 June 2023 and is funded by the German Federal Ministry of Education and Research (BMBF) until 31 August 2026.