Fundación Para La Investigación Biomédica Hospital Clínico San Carlos

Using Graph Databases to Harmonize Multi-Center Data with SNOMED CT

EMEA

Data quality, Research

Effective data sharing across institutions is crucial in low prevalence disease research to achieve statistical power and improve clinical outcomes. However, independently created clinical databases often lack semantic interoperability due to non-standardized terminologies. This project takes advantage of an implementation of SNOMED CT within a Neo4j graph database (https://github.com/IHTSDO/snomed-database-loader/), utilizing graph-based algorithms to calculate semantic distances between clinical concepts. By applying the shortest-path algorithm, we significantly reduce manual review efforts required for data harmonization. This implementation facilitates efficient and accurate identification of equivalent or closely related clinical variables across diverse research databases, thus enhancing multi-center collaboration and interoperability in low prevalence disease studies.

Description

The project addresses the challenge of harmonizing clinical research data from multiple hospitals and research centers, particularly in the context of low prevalence diseases. Each center often develops its own variables and codes independently, resulting in fragmented data structures that complicate data sharing and collaboration. To tackle this, we have deployed SNOMED CT in a graph database environment (Neo4j) to measure semantic distances between concepts across disparate datasets. This was done using the code shared in https://github.com/IHTSDO/snomed-database-loader/. Finally, Python with the neo4j library, was used to query the Neo4j database with shortest path algorithm, calculating the minimum distance between pairs of concepts (i.e. different variables in different databases).

This assists in identifying potential overlaps, reducing the manual workload involved in comparing and merging equivalent concepts. By establishing a threshold distance (i.e., number of nodes between two concepts) to flag concepts for manual review, our approach facilitates the creation of integrated, multi-center datasets critical for meaningful, and statistically significant, research and improved patient outcomes

Scope

SNOMED CT serves as the target terminology for standardizing variables retrospectively collected across diverse research datasets. In practice, each variable or data element from partner hospitals was mapped to a SNOMED CT concept (where possible). Then, SNOMED CT was loaded into a Neo4j graph database. This graph representation, capturing concepts and their relationships, enables the application of graph algorithms (specifically shortest-path) to compute semantic distances between the various SNOMED CT codes assigned to potentially similar variables across different datasets. This analysis focused only on pre-coordinated concepts to identify candidates variables for harmonization.

How SNOMED CT will be used

SNOMED CT was selected as the foundational terminology due to its international recognition, comprehensive clinical coverage, and potential to enable true semantic interoperability for healthcare data exchange and secondary use in research, in line with European recommendations. Despite the known challenge of coding variability due to its granularity, its rich, hierarchical, and multi-axial ontological structure is uniquely suited for graph-based analysis. This structure allows us to computationally assess semantic relatedness between concepts