Optimizing imbalanced learning with genetic algorithm.
Muhammad Usman Safder, Syed Sarib Naveed, Khawar Khurshid, Ahmad Salman, Imran Fareed Nizami
Abstract
Open AccessTraining AI models on imbalanced datasets with skewed class distributions poses a significant challenge, as it leads to model bias towards the majority class while neglecting the minority class. Various methods, such as Synthetic Minority Over Sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), have been employed to generate synthetic data to address this issue. However, these methods are often unable to enhance model performance, especially in case of extreme class imbalance. To overcome this challenge, a novel approach to generate synthetic data is proposed which uses Genetic Algorithms (GAs) and does not require large sample size. It aims to outperform state-of-the-art methods, like SMOTE, ADASYN, GAN and VAE in terms of model performance. Although GAs are traditionally used for optimization tasks, they can also produce synthetic datasets optimized through fitness function and population initialization. Our synthetic data generation approach analyzes the Simple as well as the Elitist Genetic Algorithms, along with Logistic Regression and Support Vector Machines to evaluate the population initialization and fitness function. Experimental results across three datasets (Credit Card Fraud Detection, PIMA Indian Diabetes, and PHONEME) demonstrate that the proposed method significantly outperforms the previous techniques based on the commonly used performance metrics, including accuracy, precision, recall, F1-score, ROC-AUC, and AP (Accuracy-Precision) curve. This highlights the potential of GAs in the development of accurate and reliable AI models for imbalanced datasets.