From BERT to GPT-4: A systematic review of AI-Driven toxicity extraction and grading in radiation oncology.
Federico Mastroleo, Mariana Borras-Osorio, Shiv P Patel, Sarah Peterson, Renthony Wilson, Mohammad Javad Namazi, Mi Zhou, Satomi Shiraishi, Andrew Y K Foong, David M Routman, Mark R Waddle
Abstract
Open AccessBackground: Toxicity assessment is a fundamental component of radiation therapy patient management. Natural language processing (NLP) and large language models (LLMs) are transforming clinical practice by efficiently extracting and synthesizing information from electronic health records (EHRs). This systematic review evaluates the current literature on the use of NLP and LLMs to extract toxicity data from radiation oncology records. Methods: Three databases were systematically searched on 14 March 2025 for English-language studies. Two reviewers screened the articles and extracted available data. Discrepancies were resolved by a third reviewer. The review adhered to PRISMA guidelines. Results: We identified 246 manuscripts; after screening, five studies were included. Four studies focused on identifying toxicity terms and linking them to CTCAE terms, while severity grading or longitudinal tracking of toxicities was addressed by two studies. One study explored the summarization capabilities of LLM to convert free text or patient surveys into concise clinician notes/chatbot responses. Included studies utilized transformer models (BERT, BioBERT, Clinical Longformer) for recognition and grading tasks; rule-based systems (Apache cTAKES, IDEAL-X) used dictionaries and negation detection rules for toxicity identification. GPT-4 demonstrated zero-shot summarization and response capabilities for patient-reported outcomes. All included studies were single-center. Common challenges identified were limited generalizability, difficulty recognizing rare or negated toxicities, privacy concerns, and substantial computing requirements for fine-tuning transformer-based models. Conclusions: Current research primarily focused on three basic tasks and three categories of models. Multi-center datasets and secure, lightweight deployment methods are needed before widespread integration into routine radiation oncology practice can be considered.