Two-Stage Probability-Enhanced Regression on Property Matrices and LLM Embeddings Enables State-of-the-Art Prediction of Gene Knockdown by Modified siRNAs.
Ivan Golovkin, Denis Shatkovskii, Nikita Serov
Abstract
Open AccessSix small interference RNAs (siRNAs) have been approved as therapeutics since 2018 making them promising nanosystems due to selective gene knockdown activity. siRNA design is complex due to various factors, where the chemical modifications are crucial to improve its half-life and stability. Machine learning (ML) enabled more efficient analysis of siRNA data, moreover predicting efficacy and off-target effects. This work proposes a novel pipeline for predicting gene knockdown activity of chemically modified siRNAs across the whole range of activities leveraging both descriptors of siRNA chemical composition-aware property matrices and large language model (LLM) embeddings for target gene encoding. Several general-purpose and domain-specific fine-tuned LLMs were benchmarked on the target task, where the Mistral 7B general-purpose model slightly outperformed even the models pre-trained on genomic data. Proposed two-stage probability-enhanced model successfully mitigates data imbalance towards moderate-to-high active constructs and achieves state-of-the-art (SOTA) quality with R2 = 0.84 and a RMSE = 12.27% on unseen data, where the probabilistic outputs of classifiers trained with F-scores up to 0.92 were used for regression model supervision. Moreover, leave-one-gene-out (LOGO) experiments show that the model is able to extrapolate on unseen genes, which further shows representativeness of siRNA features and gene embeddings. By filling the gap in the field of advanced chemical composition-aware siRNA design, our model aims to improve the efficacy of developed siRNA-based therapies.