Country / Region
EMEA
Tags
Clinical Practice, Data analytics, Genomics
A growing number of SNOMED CT disorder concept names include fragments of chromosome band nomenclature (CBN) of varying complexity. Using a mixture of text matching techniques this work has identified such concepts, performed a number of analyses on the concepts returned, and considered the implications of these findings for SNOMED CT and related products. The number of disorder concepts including CBN fragments has increased significantly since 2015. The majority of CBN fragments follow a basic 'chromosome number + arm + region + band + sub-band' format, predominantly naming partial chromosome deletions or duplications implicated in rare diseases. The following findings are presented and discussed: (1) Term forms and synonymy: for many deletions and duplications, the same abnormality is represented using multiple different term forms, with no agreed 'preferred' or 'canonical' term representation. This sort of variation can be problematic, both for content management (notably redundancy/duplicate detection) and for content use (notably display, readability and discovery). SNOMED International has an opportunity to negotiate and agree standard term forms for such concepts, and the more common CBN fragment pattern 'favourites' are discussed. (2) Curation and classification: coarse-grained chromosome structure (number and arm) is already used in concept definitions. Identifying detailed CBN fragments makes possible more precise analysis of each affected concept. Opportunities for such an approach are presented, along with implications for redundancy detection and classification (including gene locus classification and annotation of standard ideogram tools); (3) Term display and interaction: the implications of CBN nomenclature on concept use and readability are discussed.
Description
The scope of this project is limited to SNOMED CT disorder concepts, specifically looking for those concepts whose terms include a detectable fragment of chromosome band nomenclature in at least one term (such as 778007004 | 12p12.1 microdeletion syndrome). The overwhelming majority of such content is already correctly classified beneath 362984008 | Anomaly of chromosome pair, however this work seeks to investigate the variety of term patterns used for such content, detect relevant content in the wider terminology and consider how such features can be exploited to improve the product. Much of the data analysis and subsequent manipulation has been performed on the April 2025 International Edition, however analysis was also performed on previous years (from 2003) to quantify changes in concept numbers over time. Illustrative numbers for concept growth are provided in Supplemental document Section 1.
The project originally set out to address the nature and extent of synonymy between terms that contain CBN fragments. As well as considering the findings of this aspect, the poster goes on to discuss other findings afforded by a clearer understanding of the use of CBN fragments in disorder terms, notably opportunities for improved redundancy investigation, classification, interaction, display and presentation of relevant content.
Scope
Whilst many of the disorder concepts that include CBN fragments could be studied in their originating schemes (e.g. Orphanet or OMIM) it has been both possible and valuable to undertake this work in a SNOMED CT content. It has been 'possible' because there are many tools available to undertake this work which are optimised for SNOMED CT, and because the SNOMED CT data itself is easily available in a standard form. It is 'valuable' because it is important to see CBN fragment containing disorders in the context of all disorder types. Not all disorder naming conventions are alike and equally accessible to all users. CBN-fragment containing disorder names have unusual characteristics that may need special handing and presentation if they are to be made available to all SNOMED CT's users.
How SNOMED CT will be used
Methods and results.
Active descriptions (synonyms and definitions) of SNOMED CT disorder concepts have been isolated and analysed using a mixture of regular expression and text matching techniques in an effort to detect those terms which include fragments of chromosome band nomenclature [1]. Sample regular expressions are provided in the Supplemental document Section 2. The terms (and concepts) returned have been further processed and analysed to investigate:
* The nature and extent of synonymy between terms that contain CBN fragments.
* Improved redundancy investigation/detection and classification
* Interaction, display and presentation of relevant content.
Taking these in turn:
The nature and extent of synonymy between terms that contain CBN fragments.
Two major types of pattern were investigated - CBN patterns and surrounding term patterns. Whilst a small number of more elaborate patterns were detected, using wider features of the ISCN standard (for example 838355002 | Acute myeloid leukaemia with inv(16)(p13.1q22) | and 763796007|Megakaryoblastic acute myeloid leukaemia with t(1;22)(p13;q13) |) the overwhelming majority of terms use a smaller set of CBN patterns. The commonest of these include a chromosome number, an arm letter ('p' or 'q') and then varying degrees of specificity from region, band and sub-band numbers. Example CBN patterns (along with examples of use) are provided in the Supplemental document Section 3.
Surrounding term patterns were, understandably, more numerous. The long list of surrounding term patterns is provided in the Supplemental document Section 4, the commonest listed below (showing [frequency of occurrence] and substituting # for any CBN fragment):
* # deletion syndrome [22]
* # microdeletion syndrome [76]
* # microduplication syndrome [28]
* # partial trisomy syndrome [25]
* deletion of part of chromosome # [22]
* distal duplication # [21]
* distal monosomy # [28]
* distal trisomy # [31]
* duplication of chromosome # [22]
* monosomy # [67]
* partial trisomy of chromosome # [22]
* ring chromosome # syndrome [23]
* trisomy # [25]
As can be seen, the majority of common term patterns refer to deletion/monosomy and duplication/trisomy chromosomal abnormalities.
In total terms associated with 701 concepts were identified in the April 2025 data (representing 0.8% of all disorder concepts). Whilst the 170 surrounding term patterns include a long tail of terms that only appear once, analysis of the more frequent patterns reveal a smaller number (approximately 45) of clusters of concepts where the more common surrounding term patterns are used synonymously. A summary visual representation of the clustering is provided in Section 5 of the supplementary document, and a single example from this diagram is highlighted in subsequent figures to illustrate the approach taken. These show the multiple (15) ways that 'monosomy' of a 'chromosome + arm + region + band' CBN pattern may be represented in SNOMED CT - each edge on the diagram showing a single instance of synonymy, and each class showing how often each pattern is used.
Concentrating on the commonest six clusters, we can see that whilst some term patterns are used more frequently, none are identified as standard or canonical representations, that is, no single term pattern is reliably associated with a particular concept pattern. For example, on 16 occasions, disorders of the form 'monosomy' of a 'chromosome + arm + region + band' (cluster 16) are assigned '# microdeletion syndrome' as a preferred term (with this pattern used as a synonym on two occasions), however this leaves eight concepts in this group that would not be matched by this term pattern, with implications for reliable search and display. The full details of these six clusters are available in the Supplemental document Section 6.
It is hoped that identification of standard terms for concepts of this type will add a loss-less standardisation to these concepts in a SNOMED CT context, be acceptable to terminology users, improve search, display and manipulation characteristics, and add value to this complex area of disorder representation.
Improved redundancy investigation/detection and classification.
Using the synonymy identified in the earlier section it is possible to expose covert duplication in the SNOMED CT data. For example, since in the concept 766050000 Distal monosomy 15q syndrome is currently synonymous with Monosomy 15q26, we might reasonably ask what the relationship is between the two distinct concepts 763527007 Distal monosomy 13q syndrome and 766716004 |Monosomy 13q34 syndrome. Currently such 'duplicates' are likely to be false positives (since much of the implicated content is derived from carefully curated source products), but if content is added from multiple sources it is important that there are systematic mechanisms for detection.
It is also possible to identify disorders named for chromosome abnormalities (e.g. 205649008 |Trisomy 8 (disorder)) that are not currently classified as kinds of 362984008 Anomaly of chromosome pair.
Furthermore, early analysis of many of the CBN fragments in text definitions present them as the loci of discrete genes and gene abnormalities rather than as karyotypic abnormalities. This presents an opportunity to inform how SNOMED CT will incorporate and classify genes and gene abnormalities, in particular whether gene abnormalities are usefully treated as kinds of chromosome abnormality.
Interaction, display and presentation of relevant content
For purposes of display, there is value in considering/specifying that SNOMED CT terms should be sorted using a 'natural sort' approach. Failing to do this makes lists of terms containing any number sequences (including CBN fragments) hard to read, yet the alternative sorting approach ('ASCII-betical') is still commonly used. Example screenshots are provided in the Supplemental document Section 7 to illustrate this phenomenon.
Identifying CBN fragments also allows an interaction between relevant SNOMED CT concepts and chromosome visualisation tools. It is, for example, relatively straightforward to associate band and sub-band-specific CBN fragments with karyotype visualisation tools such as ideogram.js [2]. This, in turn, allows interaction with the terminology using a familiar 'chromosomal layout'. Attempting to 'read' the more complex CBN fragments is known to be cognitively testing [3], and in these situations more visual interaction techniques may well be preferable, and might be considered optional browser features where appropriate. An example screenshot is provided in Supplemental document Section 8.
[1] The chromosome band nomenclature is one part of a much more complex cytogenomic reporting standard: ISCN 2024 ‚An International System for Human Cytogenomic Nomenclature (2024) [Cytogenet Genome Res 2024;164:1‚224 DOI: 10.1159/000538512]
[2] https://github.com/eweitz/ideogram
[3] de Chambrier AF, Pedrotti M, Ruggeri P, Dewi J, Atzemian M, Thevenot C, Martinet C, Terrier P. Reading numbers is harder than reading words: An eye-tracking study. Acta Psychol (Amst). 2023 Jul;237:103942. doi: 10.1016/j.actpsy.2023.103942. Epub 2023 May 19. PMID: 37210866.
Why SNOMED CT will be used
Contact


