A hybrid unsupervised methodology on artificial intelligence filtering for automatically processing cellular DNA-encoded library (DEL) datasets.
Yiran Huang, Xiao Tan, Xiaoyu Li, Feng Xiong, Siu Ming Yiu
Abstract
Open AccessMOTIVATION: DNA-encoded library (DEL) technology has been developed as a powerful platform for drug development. Live cell-based selection methodologies were recently developed to expedite drug candidate discovery with higher biological relevance. Nevertheless, hit characterization is challenged by prominent background signals of cell-based selections. Therefore, automated data processing streamline compatible with noisy sequencing output is highly desirable. RESULTS: Herein, we report an innovative automatic method that enables the most promising hit identification from large quantities of cell-based DEL datasets with improved accuracy and efficiency. This processing workflow is based on a comprehensive unsupervised algorithm incorporating data pre-processing, feature extracting and outlier filtering, descriptor-based classification, similarity score ranking, and active compound prediction. We performed methodology development with two DEL selection datasets targeting insulin receptor (INSR) on live cells, from both ∼30 million- and 1.033 billion-membered libraries. The automated scheme has demonstrated high consistency with experimental results as well as self-adaptivity to on-cell DEL datasets with varied library scales. Extended methodology application to cellular thrombopoietin receptor (TPOR) further substantiated the algorithmic generalization capability regarding target proteins. Thus, this approach can serve as a widely applicable workflow automatically differentiating hit compounds and thereby facilitates drug development from candidate discovery. AVAILABILITY AND IMPLEMENTATION: The complete datasets, source code, and pre-trained models are made available at https://doi.org/10.5281/zenodo.17452392 and https://doi.org/10.5281/zenodo.17569557.