Heimdall: A Modular Framework for Tokenization in Single-Cell Foundation Models.
Ellie Haber, Shahul Alam, Nicholas Ho, Renming Liu, Evan Trop, Shaoheng Liang, Muyu Yang, Spencer Krieger, Jian Ma
Abstract
Open AccessFoundation models trained on single-cell RNA-sequencing (scRNA-seq) data have rapidly become powerful tools for single-cell analysis. Their performance, however, depends critically on how cells are tokenized into model inputs - a design space that remains poorly understood. Here, we present Heimdall, a comprehensive framework and open-source toolkit for systematically evaluating tokenization strategies in single-cell foundation models (scFMs). Heimdall decomposes each scFM into modular components: a gene identity encoder (F G), an expression encoder (F E), and a "cell sentence" constructor (F C) with submodules (order, sequence, and reduce) enabling fine-grained control and attribution. Using a transformer trained from scratch, we evaluate tokenization strategies for cell type classification across challenging transfer learning settings - cross-tissue, cross-species, and spatial gene-panel shifts - and separately assess reverse perturbation prediction. Tokenization choices show minimal impact in-distribution but are decisive under distribution shift, with F G and order driving the largest gains and F E providing additional improvements. Heimdall further shows how existing strategies can be recombined to enhance generalization. By standardizing evaluation and providing an extensive library, Heimdall establishes a foundation for reproducible, systematic exploration of single-cell tokenization and accelerates the development of next-generation scFMs.