Cureus

Use of a Novel Natural Language Processing Utility to Extract Structured Data From Free-Text Medical Notes.

John Culhane

Published: 202510.7759/cureus.96838

Abstract

Open Access

Introduction Electronic medical records (EMRs) contain a large amount of clinical data potentially useful for research. A barrier to the use of this data is that much of it is unstructured, mainly in the form of free-text clinical notes. Software utilities can help with extraction and conversion of this information into a usable form. Traditionally, this was performed with rule-based natural language processing (NLP) software, but, recently, large language models based on machine learning have been applied to this task. Methods A set of software utilities called Note Language Processing (NoteLP) were developed to extract information from free-text notes and convert it to datasets in standard format. A manual data search was performed in the Epic EMR to serve as the reference standard for the time required to conduct a search manually. Twenty-two sample research projects were designed. The time required to extract the required data using NoteLP was recorded and compared to the time that would be required per patient to retrieve the same data via manual search. Eleven of the search criteria for our sample projects that had corresponding ICD-9 codes were selected. A gold standard for accuracy was established via manual review. Sensitivity and specificity of ICD-9 and note extracted data were calculated versus the gold standard. Results The reference standard was established at 63.4 seconds per patient to manually classify one condition per patient via a manual Epic search. NoteLP required a mean of 37.5 seconds per patient to retrieve data for the sample projects involving ICD-9 comparisons. A further set of more extensive searches required a mean of 16.6 seconds per patient (p<0.001 for all time comparisons of NoteLP versus manual search). The mean ICD-9 sensitivity was 0.65 versus the mean note sensitivity of 0.98 (p<0.001). The mean ICD-9 specificity was 0.93 versus the mean note specificity of 0.94 (p=0.65). Conclusion The use of NoteLP is more efficient than a manual search. Its accuracy compares favorably with manual coding, achieving greater sensitivity and equal specificity. Rule-based NLP utilities such as NoteLP remain a valuable tool for the extraction of research information from unstructured medical text.