An empirical evaluation of dimensionality reduction and class balancing for medical text classification.
Arslan Jamil, Muhammad Kashif Hanif, Muhammad Umer Sarwar, Muhammad Irfan Khan
Abstract
Open AccessThe exponential growth of unstructured clinical text in electronic health records presents significant opportunities and challenges for data-driven healthcare decision-making. While automated classification of clinical narratives can unlock valuable insights, the high dimensionality and inherent sparsity of textual features often degrade model performance and computational efficiency. To address this, this study presents a lightweight continuum-reduction model that compresses longitudinal patient narratives onto a low-rank manifold without degrading clinical fidelity. Three manifold-learning techniques-Principal Component Analysis (PCA), t-distributed Stochastic Neighbour Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP)-were coupled with the Synthetic Minority Over-sampling Technique (SMOTE). Our results demonstrate that the combination of PCA and SMOTE consistently delivers superior performance, achieving 91.2% accuracy on the MTSamples corpus using 5-fold CV protocol, a statistically significant 6.4% macro-F[Formula: see text] gain over the unreduced baseline for traditional classifiers, and 42% faster training. This pipeline reduces model training time by 42% compared to unreduced baselines, enhancing computational efficiency without compromising diagnostic accuracy. These findings provide robust empirical evidence and a practical, scalable solution for healthcare institutions to deploy efficient clinical natural language processing pipelines, enabling large-scale analysis of medical narratives while preserving decision quality.