Compositional data modeling of high-dimensional single cell RNA-seq (CoDA-hd): its advantages over commonly used normalization approaches.
Jinghan Huang, Sheung Chi Phillip Yam, K S Leung, Minghua Deng, Nelson L S Tang
Abstract
Open AccessBACKGROUND: Compositional data analysis (CoDA) is an emerging statistical framework and has been extended to microbiome, bulk RNA-seq, and cell type proportions in single-cell RNA-seq (scRNA-seq), which typically has 50-200 components. Here, we explore the high-dimensional application of CoDA (CoDA-hd) and its various log-ratio (LR) transformations to raw count matrix of scRNA-seq which has over 20,000 components (e.g., protein coding genes). scRNA-seq matrices are typically sparse and high-dimensional. Common approaches of normalization such as log-normalization may lead to suspicious findings as previously shown for trajectory inference. Although RNA-seq is compositional data by nature, the geometry of CoDA in high-dimensional simplex is not compatible with most downstream analyses of scRNA-seq which are based on Euclidean space. In this study, we attempted to explore: (1) CoDA adaptability to scRNA-seq; (2) handling of zero data: prior-log-normalization, imputation or with specific count addition scheme; (3) transformation to Euclidean space and compatibility with downstream analyses. RESULTS: Our results suggest that (1) the innovative count addition schemes (e.g., SGM) enable the application of CoDA to high dimensional sparse data (i.e., scRNA-seq); (2) log-normalized data could be transformed to CoDA LR representation; (3) CoDA LR transformations such as count-added centered-log-ratio (CLR) had some advantages in dimension reduction visualization, clustering, and trajectory inference in the tested real and simulated datasets. CLR provided more distinct and well-separated clusters in dimension reductions, improved the Slingshot trajectory inference, and eliminated the suspicious trajectory that was probably caused by the dropouts. CONCLUSIONS: We therefore conclude that CoDA may be a preferred scale-free model to handle scRNA-seq data for these downstream tasks. Additionally, an R package 'CoDAhd' was developed for conducting CoDA LR transformations for high dimensional scRNA-seq data. The code for implementing CoDA-hd, along with some example datasets, are available at https://github.com/GO3295/CoDAhd .