Scaling transformers to high-dimensional sparse data: a Reformer-BERT approach for large-scale classification.
Wanxuan Li, Xinhua Li, Weihang Guo, Boyuan Gu, Jianjun Du, Ning Chi, Dan Shao, Kai Xiao, Ren Mo
Abstract
Open AccessObjective: The precise identification of human cell types and their intricate interactions is of fundamental importance in biological research. Confronted with the challenges inherent in manual cell type annotation from the high-dimensional molecular feature data generated by single-cell RNA sequencing (scRNA-seq)-a technology that has otherwise opened new avenues for such explorations-this study aimed to develop and evaluate a robust, large-scale pre-trained model designed for automated cell type classification, with a focus on major cell categories in this initial study. Methods: A novel methodology for cell type classification, named scReformer-BERT, was developed, leveraging a BERT (Bidirectional Encoder Representations from Transformers) architecture that integrates Reformer encoders. This framework was subjected to extensive self-supervised pre-training on substantial scRNA-seq datasets, after which supervised fine-tuning and rigorous five-fold cross-validation was performed to optimize the model for predictive accuracy on targeted first-tier cell type classification tasks. A comprehensive ablation study was also conducted to dissect the contributions of each architectural component, and SHAP (SHapley Additive exPlanations) analysis was used to interpret the model's decisions. Results: The performance of the proposed model was rigorously evaluated through a series of experiments. These evaluations, conducted on scRNA-seq data, consistently revealed the superior efficacy of our approach in accurately classifying major cell categories when compared against several established baseline methods and the inherent difficulties in the field. Conclusion: Considering these outcomes, the developed large-scale pre-trained model, which synergizes Reformer encoders with a BERT architecture, presents a potent, effective and interpretable solution for automated cell type classification derived from scRNA-seq data. Its notable performance suggests considerable utility in improving both the efficiency and precision of cellular identification in high-throughput genomic investigations.