To comprehend how even the smallest details affect human biology, researchers have compiled enormous single-cell gene expression databases. Current analysis techniques, however, are unable to handle the vast amount of data and thus yield inconsistent and skewed results.
To produce more accurate results, researchers at St. Jude Children's Research Hospital developed a machine-learning algorithm that can scale with these single-cell data repositories. The journal Cell Genomics published the new technique.
Bulk gene expression data provided high-level but imprecise results for numerous diseases before single-cell analysis. Researchers can examine individual cells of interest using single-cell analysis; this is similar to examining a single corn kernel rather than a field.
Though progress has been hampered by the difficulty of reproducing and scaling analyses for data that keeps growing in size, these in-depth insights have already led to breakthroughs in the understanding of some diseases and treatments.
We have implemented a new toolset that can be scaled as these single-cell RNA sequencing datasets continue to grow. There has been an exponential explosion in the compute time for single-cell analysis, and our method brings accurate analysis back into a tractable timeframe.”
Paul Geeleher, PhD, Study Corresponding Author, Department of Computational Biology, St. Jude Children’s Research Hospital
Large volumes of data are produced by every method used to investigate single-cell gene expression. The amount of computer memory and processing power required to handle the data is enormous when scientists test millions of cells at once. To address the issue, Geeleher's team looked to an alternative type of hardware.
We created a method that uses graphics processing units or GPUs. The GPU integration gave us the processing power to perform the computational load in a scalable way.”
Xueying Liu, PhD, Study First Author, Department of Computational Biology, St. Jude Children’s Research Hospital
Unsupervised Machine Learning for Single-Cell Analysis
When using standard methods to conduct analyses, the amount of data frequently forces researchers to make assumptions and concessions that introduce biases. The artificial intelligence method employed by the St. Jude researchers eliminates this kind of bias from these choices.
Liu said, “Our method uses unsupervised machine learning, which automatically determines more robust and less arbitrary parameters for the analysis. It learns how to group cells based on their different active biological processes or cell type identities.”
Researchers could apply the algorithm to any sizable single-cell RNA sequencing dataset because it learns and derives its analysis from the data presented. The researchers named the method the Consensus and Scalable Inference of Gene Expression Programs (CSI-GEP) because it examines each new large dataset separately and only draws conclusions based on those expression program clues.
CSI-GEP outperformed all other methods when applied to the largest single-cell RNA databases. The algorithm's ability to recognize cell types and biological processes that are missed by other methods.
We have created a tool broadly applicable to studying any disease through single-cell RNA analysis. The method performed substantially better than all existing approaches we tested, so I hope other scientists consider using it to get better value out of their single-cell data.”
Paul Geeleher, PhD, Study Corresponding Author, Department of Computational Biology, St. Jude Children’s Research Hospital
Source:
Journal reference:
Liu, X., et al. (2025) CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data. Cell Genomics. doi.org/10.1016/j.xgen.2024.100739.