In everyday clinical practice, numerous texts are produced, such as doctors' letters and reports, which contain valuable information about the development, course, and treatment of a disease. These texts could be used by natural language processing (NLP) tools to assist doctors and researchers in their work. However, the full potential of clinical documents cannot be realised due to a lack of standardisation. The GeMTeX (German Medical Text Corpus) methodology platform aims to fill this gap and make medical texts from patient care available for research projects. The goal is to create the largest medical text corpus in the German language.

Before routine care texts can be used for clinical and research purposes, they must first be made readable by computer-based natural language processing (NLP) programmes. This requires large amounts of annotated text from routine patient care. Annotated texts are documents that contain additional information through systematic annotations, e.g. information on diagnoses or medications. The annotations are manually reviewed by medical students and serve as a reference for further improvement of the automatic annotation. Information structured in this way can be used together with existing data for analysis and statistical modelling.

The IT infrastructure that has been built during the development and networking phase of the Medical Informatics Initiative (MII) between 2018 and 2022 offers the possibility of making clinical documents accessible on a large scale and enriching them with systematic annotations. The MII methodology platform GeMTeX aims to address the two major bottlenecks of current language models: data accessibility and data annotation.

Large collection of German-language medical texts from patient care is being created

Within the framework of GeMTeX, six university medical centres in Munich, Leipzig, Essen, Berlin, Dresden and Erlangen are collecting documents from electronic patient files (ePA) with the consent of the patients. Using natural language processing, the documents are processed in compliance with data protection regulations and made available in anonymised form for joint use. This creates a valuable text repertoire for research and development.

In addition, GeMTeX will create a central technical and organisational structure to collect anonymised texts and process them for enrichment according to guidelines. The resulting text database can be used to train AI models and test their usefulness in everyday clinical practice.

The GeMTeX Methodology Platform was launched on 1 June 2023 and is funded by the German Federal Ministry of Education and Research (BMBF) with around seven million euros until 31 August 2026.

Further information:

Interview with Christina Lohr and Luise Modersohn, research assistants in the GeMTeX project


Project Lead

Prof. Dr. Martin Boeker
Network coordinator
Professor of Medical Informatics
Technical University of Munich/University Hospital rechts der Isar

Martin Boeker. Bildquelle: Klinikum rechts der Isar_Technische Universität München
Prof. Dr. Martin Boeker
© Klinikum rechts der Isar, Technische Universität München


Prof. Dr. Markus Löffler
Deputy Network Coordinator
Head of the SMITH Consortium
Institute für Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Prof. Markus Löffler © Universitätskilinikum Hamburg-Eppendorf/Ronald Frommann
Prof. Dr. Markus Löffler
© Universitätskilinikum Hamburg-Eppendorf/Ronald Frommann


Project coordination:

Janina Kind
Administrative Project Management
Leipzig University

Janina Kind. Bildquelle: UKL
Janina Kind

Dr. Frank Meineke
Scientific Project Management
Institute for Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Frank Meineke. Bildquelle: Swen Reichhold
Dr. Frank Meineke
© Swen Reichhold

Christina Lohr
Scientific Project Management/Lead Annotation
Institute for Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Christina Lohr © privat
Christina Lohr
© privat


  • Charité – University Hospital Berlin


  • ID GmbH & Co. KGaA


  • Technical University of Darmstadt

  • Dresden University of Technology

  • University Hospital Erlangen

  • University Hospital Essen


  • Averbis GmbH

  • Hannover Medical School

  • Heidelberg University Hospital

  • German National Library of Medicine (ZB MED)

  • Leipzig University


    University of Leipzig Medical Center

  • Ludwig Maximilian University of Munich

  • Technical University of Munich


  • University of Münster

  • Hasso Plattner Institute for Digital Engineering gGmbH

  • Tübingen University Hospital


  • Medical University of Graz (Associated Partner)