JAMIA open

Using natural language processing to identify symptoms in systemic mastocytosis.

Fagen Xie, Kevin Y Tse, Chantal C Avila, Matt Zhou, Mary Saparudin, Hiba Atif, Robert S Zeiger, Kerri Miller, Dakota Powell, Erin Sullivan, Ben Lampson, Eric J Puttock, Chris Yuen, Wansu Chen

Published: 202510.1093/jamiaopen/ooaf154

Abstract

Open Access

Background: Systemic mastocytosis (SM) is a rare disorder with heterogeneous, multisystem symptoms that often lead to diagnostic delays. Real-world symptom documentation in unstructured clinical notes may hold untapped potential for earlier recognition, but scalable extraction methods are lacking. Objective: To develop and validate a natural language processing (NLP) algorithm to identify 23 potentially SM-related symptoms from unstructured electronic health record (EHR) documentation and assess its application across SM and comparison groups. Methods: We conducted a retrospective study using EHR data from Kaiser Permanente Southern California (2008-2023). SM patients (n = 135) were matched 1:2:3 to chronic spontaneous urticaria (CSU) patients (n = 270) and non-SM/non-CSU controls (n = 405). A rule-based NLP algorithm was developed using manually annotated training data (339 patients, 57 495 notes) and validated against double-annotated notes from 16 SM patients (818 notes). Performance was measured by precision, recall, and F1 score. The algorithm was then applied to 118 252 notes from the full cohort. Results: The algorithm achieved strong overall performance, with precision and recall exceeding 90% for most symptoms. Lower precision was observed for "epigastric or abdominal bloating" (77%) and "swelling" (80%) due to ambiguous or non-patient references. Notes documenting at least one symptom occurred in 15.9% of the full dataset, including 19.6% of CSU notes, followed by SM (15.5%) and non-SM/non-CSU (11.0%). SM notes had a higher frequency of gastrointestinal (6.8%) and systemic symptoms (3.9%), while CSU notes more often included cutaneous symptoms (13.3%). SM notes also had the greatest proportion documenting 5 or more distinct symptoms (0.7%), suggesting more complex symptom patterns. Conclusion: This study demonstrates the feasibility of using rule-based NLP to identify SM-related symptoms from unstructured EHR narratives. The approach achieved strong performance and revealed meaningful patterns in symptom documentation. These findings support the broader utility of NLP for characterizing rare disease symptoms, informing early recognition, and enhancing data-driven care strategies.