Through the Medical Informatics Initiative (MII) and the establishment of Data Integration Centers, clinical care data from multiple sources in the Hospital Information System (HIS) are being made available for medical research. The result is a unique and rich pool of clinical data that is precisely defined across all participating sites. With the methodological Use Case Phenotyping Pipeline, or PheP for short, the SMITH Consortium supported the development, qualitative enrichment and evaluation of the data pool from 2018 to 2023. The project is led by the Leipzig University.

The PheP idea: Enriching health data and making it best available to science

Server - Copyright: Liljam/

PheP has provided tools and methods to use medical data in a privacy-compliant way and to address new types of questions. Over the course of five years, the methodological use case has laid the groundwork for several other MII projects. This required the creation of data sets that could be used for clinical epidemiological and health economic questions.

Phenotyping was used in PheP to derive and provide further characteristics from identifiable characteristics of patients (phenotypes). In this way, certain laboratory values and medications can be used to indicate other diseases or to find suitable patients for participation in a study. The Terminology- and Ontology-based Phenotyping (TOP) Junior Research Group emerged from this part of the project. Since 2021, the junior research group has been working on a software platform for modeling and executing phenotype algorithms. It is located at the Leipzig University and led by Dr. Alexandr Uciteli. PheP has supported the record linkage process, which is used to merge patient data from different information sources, such as health insurance companies or death data from civil registries.

Making medical texts available for research

However, too little clinical information is available as machine-readable records - a challenge for the implementation of the PheP project. In particular, referral letters, findings, and surgical reports contain valuable information such as diagnoses, medications, side effects, and laboratory data that can only be extracted using natural language processing (NLP) and semantic text analysis methods. The PheP-NLP project has developed new methods for extracting complex information such as diagnoses, findings, medications or side effects from text. The Jena University Language & Information Engineering Lab (JULIE Lab) in cooperation with companies in the field of language processing academically led the process. In a pilot project with the three university hospitals in Aachen, Jena and Leipzig, medical documents of more than 3,000 patients were analyzed. The experience gained has led to a new project that is unique in Germany: the German Medical Text Corpus (GeMTeX) project. Together with 16 partners – including six university hospitals – it will build up by far the largest collection of German-language medical texts for NLP research.

The PheP Engine enables data protection compliant analysis

The idea of distributed analysis became the technical basis for the implementation of comprehensive data use projects at the MII locations in the form of the "PheP Engine". The secure technology of the PheP Engine enabled distributed analysis of semantically and technically standardized data at all sites. With distributed analysis, sensitive patient data remains in the clinic while the algorithms access the data. With this technology, different clinical questions can be addressed flexibly and in compliance with privacy regulations. Distributed analysis is used in several MII Projectathons beyond the consortia, for example in the cardiology data use project "NT-proBNP". 

PheP is funded by the German Federal Ministry of Education and Research (BMBF) from 01/01/2018 to 31/05/2023 as part of the SMITH Consortium.

Prof. Markus Löffler © Universitätskilinikum Hamburg-Eppendorf/Ronald Frommann

„SMITH focuses on the current challenges of digitization. Through the sustainable use of care data in medical research, decisive steps are taken to improve diagnosis, prevention and therapy. With these steps health care can be taken to a new level.“

Prof. Dr. Markus Löffler                             Head of the SMITH Consortium | Head of the PheP Use Case, Institute for Medical Informatics, Statistics and Epidemiology (IMISE) | Leipzig University


Read more about the results from the use case PheP:


  • Meineke FA, Stäubert S, Löbe M, Uciteli A, Löffler M:
    Design and Concept of the SMITH Phenotyping Pipeline. 
    In: Stud Health Technol Inform. 2019 Sep 3;267:164-172. doi: 10.3233/SHTI190821. PMID: 31483269.
  • Uciteli A, Beger C, Kirsten T, Meineke FA, Herre H:
    Ontological representation, classification and data-driven computing of phenotypes. 
    In: J Biomed Semantics. 2020 Dec 21;11(1):15. doi: 10.1186/s13326-020-00230-0. PMID: 33349245; PMCID: PMC7751121.
  • Hahn U, Matthies F, Lohr C, Löffler M:
    3000PA-Towards a National Reference Corpus of German Clinical Language. 
    In: Stud Health Technol Inform. 2018;247:26-30. PMID: 29677916.