HemPepPred: Quantitative Prediction of Peptide Hemolytic Activity Based on Machine Learning and Protein Language Model-Derived Features.
Xiang Li, Wanting Zhao, Xiao Liang, Xinlan Zhuo, Shuang Yu, Guizhao Liang
Abstract
Open AccessAccurate prediction of hemolytic peptides is essential for peptide safety evaluation and therapeutic design; however, existing models remain constrained by limited accuracy and interpretability. To overcome these challenges, we propose a regression framework that integrates embeddings from a protein language model with handcrafted amino acid descriptors. Specifically, sequence representations derived from the ESM2_t33 model are fused with physicochemical amino acid descriptor features, and key predictive variables are selected through a three-stage strategy involving variance filtering, F-test ranking, and mutual information analysis. The final ensemble model, composed of Random Forest, Extremely Randomized Trees, Gradient Boosting, eXtreme Gradient Boosting (XGBoost), and Ridge Regression, achieved a coefficient of determination (R2) of 0.57 and a correlation coefficient (R) of 0.76 on the test set, outperforming previous approaches. To enhance interpretability, we applied Shapley value analysis and the Calibrated_Explanation algorithm to quantify feature contributions and generate reliable sample-specific explanations. The trained model has been deployed online as HemPepPred, a tool for predicting hemolytic concentration (HC50) values, which provides a practical platform for rational peptide design and safety assessment.