University Hospital Erlangen, Data Integration Center

Towards LLM-based Annotation of German Clinical Forms with SNOMED CT

EMEA

Artificial intelligence, Mapping, Tooling, Translation

We present an automated pipeline for SNOMED CT code selection to annotate German medical documentation forms from a hospital information system. Unlike for the English language, there is a shortage of German Large Language Models (LLMs) and German ground truth corpora with SNOMED CT for this use case. This presentation is one of the first approaches in this direction.

Because of data protection concerns, the pipeline needs to be locally implemented. Therefore, small local LLMs within the hospital-located data integration centre Erlangen, such as mistralai/Mistral-7B-Instruct-v0.3 and unsloth/Meta-Llama-3.1-8B-Instruct, were chosen. As data integration centres are part of the German Medical Informatics Initiative (MII), the pipeline specially considers the MII core dataset by prioritising English SNOMED CT codes from this dataset compared to other SNOMED CT codes of the International Edition, to encourage standardisation across German healthcare providers. Embedding models are combined with the SNOWSTORM server API search to pre-select the most appropriate codes. While manual annotation of medical forms still outperforms automated annotations, the proposed pipeline can provide useful preannotation suggestions. Recommending annotations with unsloth/Meta-Llama-3.1-8B-Instruct using a xlreator/biosyn-biobert-snomed-embedding-based pre-selection showed the most promising results, achieving approximately 46.2% accuracy compared with human-annotated ground truth when the LLM recommended a single code. Furthermore, performance increased significantly when considering more suggestions, reaching a recall of 57.8% for 5 suggested codes.

Description

Introduction

The German Medical Informatics Initiative (MII) aims to make clinical routine care data accessible for research at both the local and national levels. As part of this effort, a FHIR-based MII Core Data Set (MII CDS) was created, which utilises SNOMED CT codes. As part of the clinical routine data, medical documentation forms in German hospitals are essential for daily patient data documentation. Unfortunately, they suffer from poor, missing semantic annotation as well as standardisation within and across departments.

Therefore, semantically linking form items to SNOMED CT codes enhances clarity and improves semantic interoperability, enabling easier analysis, sharing, and integration across systems and studies.

Method
In this presentation, we propose a first step toward accelerating the annotation process by leveraging local Large Language Models (LLMs) and using the MII CDS to support automated SNOMED CT coding of German medical forms. Locally hosted LLMs, such as unsloth/Meta-Llama-3.1-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.3, were selected due to privacy concerns associated with non-local models. For each item, the models were presented with the results of a SNOWSTORM server API search as well as a pre-selection of MII CDS-SNOMED CT concepts based on a similarity search utilising an embedding model. For this step, we tested sentence-transformers/all-mpnet-base-v2, xlreator/biosyn-biobert-snomed, and abhinand/MedEmbed-base-v0.1. The model was then asked to select the best-matching codes for each item.

Tumour board forms from the University Hospital Erlangen (UKER), which play a major role in the interdisciplinary documentation and decision making in complex cancer cases, were chosen as a representative use case. As a worldwide essential part of cancer patient treatment, tumour boards are usually multi-/interdisciplinary conferences of physicians to discuss the best possible diagnosis and treatment plans for oncological patients based on previous diagnostic imaging, laboratory, and other examination methods. Semantic interoperability of the tumour board forms is essential to effectively reuse the information collected during these meetings. The automated approach presented in this paper was evaluated by comparing the SNOMED CT codes suggested by the pipeline with expert-annotated ground truth data.

Results
We show that 48% of the forms were able to be annotated with precoordinated SNOMED CT concepts (IAA Cohen's Kappa (Œ∫ = 0.75 micro, 0.75 macro)), with 4.8% being part of the MII CDS. The unsloth/Meta-Llama-3.1-8B-Instruct small model, utilising xlreator/biosyn-biobert-snomed embeddings, demonstrated the most promising results, achieving a recall of 46.2% for the top selected code and up to 57.8% when considering the top five selected codes, in comparison with the ground truth.

Discussion and Conclusion
There is a limited availability of publicly accessible, medical-domain LLMs and German-language ground truth datasets for clinical forms. While manual annotation currently still outperforms automated annotations, our proposed pipeline can provide useful pre-annotation suggestions. Most non-annotated items were excluded due to non-mappable local peculiarities or non-relevant supporting protocol instructions (e.g., proper names). Outdated SNOMED CT concepts within the MII CDS and translation issues complicated the automated annotation process. Our pipeline will be used to standardise the forms at the UKER based on SNOMED CT. Challenges and opportunities for further research steps were defined on a linguistic, technical, and semantic level (e.g., prioritisation of specific semantic tags in code selection).

Scope

The goal for the German Medical Informatics Initiative (MII) is to make clinical routine data accessible to researchers on a national and international level. For local and cross-local data exchange, data integration centres are located at university hospitals, where medical forms are a common documentation form.

The hospitals use the shared basic and extension modules of the MII based on the data structure of the HL7 FHIR standard and the semantic annotation on international terminologies. SNOMED CT was selected as it is a leading international standard in healthcare and is supported by the German National Release Centre.

How SNOMED CT will be used

* SNOMED CT is used to automatically annotate forms from a hospital information system with a LLM with SNOMED CT concepts to improve interoperability and facilitate data integration within the German Medical Informatics Initiative (MII).

* SNOMED CT is used to annotate the ground truth for German medical forms.

* The SNOMED CT codes of the German MII Core Data Set are prioritised for better standardisation

* We chose to work with the International Edition of SNOMED CT because, despite national efforts by the German Translation Group to translate the International SNOMED CT Edition step-by-step, there is no complete official translation available on which we could base our work.