NHG Health

Extracting disorders and contextual situations from clinical notes and standardising them to SNOMED CT: A hybrid approach leveraging Large Language Models and Named Entity Recognition and Linking (NER+L) tools

APAC

Artificial intelligence, Data analytics, Innovation, Research

This study addresses the challenge of preprocessing, extracting, and standardising valuable clinical information from unstructured medical texts such as clinical notes, and anchoring such clinical notes analyses with the SNOMED CT ontology. Although hospital data extraction and classification systems serve quality assurance and administrative purposes, they often fall short in providing comprehensive clinical information crucial for practice and research. To address this gap, we developed a novel hybrid approach that combines Large Language Models (specifically Phi4-14B) with MedCAT, an established Named Entity Recognition and Linking tool, to automatically extract and standardise clinical information from unstructured medical texts such as clinical notes using SNOMED CT as the reference ontology. Using selected publicly available unstructured medical texts (e.g., from the MIMIC IV dataset), our qualitative assessment demonstrates promising results in accurate information extraction and standardisation. To illustrate potential real-world applications, we describe how this approach could potentially be applied as clinical decision support for patient care pathway optimisation.

Description

Background:
Clinical case notes contain rich, detailed information about patient care such as comorbidities (if any), signs and symptoms, past medical history, and relevant family history that are often not captured elsewhere in structured formats. Although hospital data extraction and classification systems serve quality assurance and administrative purposes, they often fall short in providing comprehensive clinical information crucial for practice and research. Despite recent advancements in artificial intelligence and big data analytics showing promise to address these challenges, significant challenges remain. Not only is there a need to further improve the accuracy of extractions and classifications, there is also a need to anchor such extractions and classifications within a widely recognised medical ontology to enhance the accuracy, standardization, and usability of the extracted data. Successfully addressing these challenges can enable larger-scale research, enhance data analytics capabilities, and strengthen evidence-based healthcare delivery, ultimately leading to better healthcare outcomes.

Objective:
Our study investigates the feasibility of a hybrid approach combining Large Language Models (LLMs) with MedCAT, an established Named Entity Recognition and Linking (NER+L) tool, to systematically preprocess, extract and standardise clinical disorders and contextual situations from unstructured clinical notes, anchoring the extracted clinical information to SNOMED CT clinical terminology. This hybrid approach leverages LLMs, advanced natural language processing capabilities for information extraction alongside MedCAT's SNOMED CT knowledge base and mapping to ensure standardised medical terminology coding. By systematically extracting and anchoring clinical information to a standardised medical ontology knowledge graph, this hybrid approach potentially enables numerous benefits such as enhanced clinical decision support, facilitates knowledge discovery, and supports downstream data analytics and research applications. To demonstrate practical utility of our hybrid approach in enhancing healthcare delivery and outcomes, we will share a potential application of this workflow in a clinical decision support tool for patient care pathway optimisation.

Methods:
We developed and experimented with various variants of workflows that incorporate Phi4-14B (an LLM), run locally, to preprocess and extract clinical disorder and contextual situations information from clinical notes, and using MedCAT (an NER+L tool based on SNOMED-CT) to help us ensure that our extracted information are anchored on SNOMED-CT. We conducted a qualitative assessment using purposefully selected publicly available clinical notes (e.g., from the Medical Information Mart for Intensive Care (MIMIC) IV dataset). Using SNOMED CT as a reference, we qualitatively assessed the accuracy of the extraction arising from our hybrid approach.

Results:
Our preliminary findings suggest potential promise of leveraging the complementary strengths of LLMs and NER+L tools to preprocess, extract and standardise clinical disorder and contextual situations from unstructured clinical notes.

Scope

The study team selected SNOMED CT for our research based on several strategic considerations. First, as a healthcare cluster comprising multiple healthcare institutions, we require a comprehensive medical ontology that ensures consistent data extraction and interpretation across all our institutions. As a widely adopted medical ontology that is regularly updated and maintained, its robust concept hierarchy enables us to minimise variability in data extraction and conduct uniform coding practices across different clinical settings, which is crucial for conducting meaningful cross-institutional studies.

Second, having comprehensive coverage, the flexibility to adapt/expand, and being able to cater for potential future research directions and opportunities are equally important considerations. Our current study is our initial effort to conduct case notes analyses of disorders and contextual situations using artificial intelligence. SNOMED CT, with its extensive coverage of clinical concepts, domains, and their relationships, not only provides a robust foundation to anchor our analyses, but also positions us well for future expansion of case notes analyses and applications into domains that may involve for example, findings, procedures, substances, organisms, and social contexts. Further, regular updates by SNOMED International ensure the terminology remains current with medical advances, and where emerging clinical relevant information arise, SNOMED provides the flexibility to represent new or complex clinical scenarios and/or to create new concepts.

How SNOMED CT will be used

The SNOMED CT plays a vital role in our case notes analysis by providing an internationally recognized medical ontology. Although Large Language Models (LLMs) demonstrate capabilities to extract information from clinical notes, their outputs may not be anchored on any comprehensive medical ontology and often lack standardization, thereby limiting usability. To address this limitation, the study team developed a workflow incorporating LLMs and MedCAT, using the latter to leverage SNOMED CT comprehensive clinical terminology. This approach ensures that the extracted disorders together with relevant contextual situations (past/present, self/family) are anchored on SNOMED CT, enabling more standardised, high quality information extraction.