In everyday clinical practice, numerous texts are produced, such as doctors' letters and reports, which contain valuable information about the development, course, and treatment of a disease. These texts could be used by natural language processing (NLP) tools to assist doctors and researchers in their work. However, the full potential of clinical documents cannot be realised due to a lack of standardisation. The GeMTeX (German Medical Text Corpus) methodology platform aims to fill this gap and make medical texts from patient care available for research projects. The goal is to create the largest medical text corpus in the German language.
LogoGeMTeX

Before routine care texts can be used for clinical and research purposes, they must first be made readable by computer-based natural language processing (NLP) programmes. This requires large amounts of annotated text from routine patient care. Annotated texts are documents that contain additional information through systematic annotations, e.g. information on diagnoses or medications. The annotations are manually reviewed by medical students and serve as a reference for further improvement of the automatic annotation. Information structured in this way can be used together with existing data for analysis and statistical modelling.

The IT infrastructure that has been built during the development and networking phase of the Medical Informatics Initiative (MII) between 2018 and 2022 offers the possibility of making clinical documents accessible on a large scale and enriching them with systematic annotations. The MII methodology platform GeMTeX aims to address the two major bottlenecks of current language models: data accessibility and data annotation.

Large collection of German-language medical texts from patient care is being created

Within the framework of GeMTeX, six university medical centres in Munich, Leipzig, Essen, Berlin, Dresden and Erlangen are collecting documents from electronic patient files (ePA) with the consent of the patients. Using natural language processing, the documents are processed in compliance with data protection regulations and made available in anonymised form for joint use. This creates a valuable text repertoire for research and development.

In addition, GeMTeX will create a central technical and organisational structure to collect anonymised texts and process them for enrichment according to guidelines. The resulting text database can be used to train AI models and test their usefulness in everyday clinical practice.

The GeMTeX Methodology Platform was launched on 1 June 2023 and is funded by the German Federal Ministry of Education and Research (BMBF) with around seven million euros until 31 August 2026.

Further information:

https://www.smith.care/en/gemtex_mii/about-gemtex/

GeMTeX fact sheet

Interview with Christina Lohr and Luise Modersohn, research assistants in the GeMTeX project

Contact:

Project Lead

Prof. Dr. Martin Boeker
Network coordinator
Head of the DIFUTURE Consortium
Professor of Medical Informatics
Technical University of Munich/University Hospital rechts der Isar

Martin Boeker. Bildquelle: Klinikum rechts der Isar_Technische Universität München
Prof. Dr. Martin Boeker
© Klinikum rechts der Isar, Technische Universität München

 

Prof. Dr. Markus Löffler
Deputy Network Coordinator
Head of the SMITH Consortium
Institute für Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Prof. Markus Löffler © Universitätskilinikum Hamburg-Eppendorf/Ronald Frommann
Prof. Dr. Markus Löffler
© Universitätskilinikum Hamburg-Eppendorf/Ronald Frommann

 

Project coordination:

Janina Kind
Administrative Project Management
SMITH-Office
Leipzig University
 

Janina Kind. Bildquelle: UKL
Janina Kind
© UKL

Dr. Frank Meineke
Scientific Project Management/Technical management
Institute for Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Frank Meineke. Bildquelle: Swen Reichhold
Dr. Frank Meineke
© Swen Reichhold

Luise Modersohn
Scientific Project Management/Lead Annotation
Institute for AI and Informatics in Medicine
Technical University of Munich/University Hospital rechts der Isar

Luise Modersohn (c) K. Czoppelt-Klinikum rechts der Isar
Luise Modersohn
© K. Czoppelt/Klinikum rechts der Isar

Christina Lohr
Scientific Project Management
Institute for Medical Informatics, Statistics and Epidemiology (IMISE)
Leipzig University

Christina Lohr © privat
Christina Lohr
©Christina Lohr

Partners:

  • Charité – University Hospital Berlin

     

  • ID GmbH & Co. KGaA

     

  • Technical University of Darmstadt
  • Dresden University of Technology
  • University Hospital Erlangen
  • University Hospital Essen

     

  • Averbis GmbH
  • Hannover Medical School
  • Heidelberg University Hospital
  • German National Library of Medicine (ZB MED)
  • Leipzig University
  •  

    University of Leipzig Medical Center

  • Ludwig Maximilian University of Munich
  • Technical University of Munich

     

  • University of Münster
  • Hasso Plattner Institute for Digital Engineering gGmbH
  • Tübingen University Hospital

     

  • Medical University of Graz (Associated Partner)