Examining gene expression changes helps researchers understand cellular functions at a molecular level, providing insights into the onset of various diseases.
However, with humans possessing around 20,000 genes that interact in complex ways, pinpointing specific gene groups to study is challenging. Additionally, genes often function in interconnected modules, regulating each other’s activity.
Researchers at MIT have now developed theoretical foundations for methods that could optimally group genes into clusters, making it easier to identify cause-and-effect relationships between multiple genes. This innovative approach, crucially, relies solely on observational data, eliminating the need for costly or impractical experimental interventions to establish causal links.
In the future, this method could help scientists more accurately identify gene targets to influence specific behaviors, potentially leading to precise treatments for patients.
In genomics, understanding the mechanisms behind cell states is essential. But since cells have a multiscale structure, the level of summarization is also crucial. Finding the right way to aggregate observed data can make the information about the system more interpretable and useful.
Jiaqi Zhang, Graduate Student and Eric and Wendy Schmidt Center Fellow, MIT
Zhang collaborated with co-lead author Ryan Welch, a Master’s student in Engineering, and Senior Author Caroline Uhler, a Professor in MIT's Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society.
Uhler is also the Director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and a researcher at MIT’s Laboratory for Information and Decision Systems. Their research will be presented at the Conference on Neural Information Processing Systems.
Learning from Observational Data
The researchers tackled the challenge of learning gene programs—how genes work together to regulate other genes in biological processes such as cell development or differentiation. Since it’s impractical to study interactions among all 20,000 genes, they used a technique called causal disentanglement to combine related gene groups into a model that simplifies cause-and-effect exploration.
Previously, the researchers demonstrated this could be done effectively with interventional data, which is collected by altering network variables. However, interventional experiments are costly and, in some cases, unethical or limited by technology.
With only observational data, researchers lack the ability to compare genes before and after intervention to understand gene group interactions.
“Most causal disentanglement research assumes access to interventions, so it was unclear how much information we could disentangle with just observational data,” explains Jiaqi Zhang, MIT Graduate Student and Eric and Wendy Schmidt Center Fellow.
The MIT researchers developed a broader approach, utilizing a machine-learning algorithm to group observed variables, such as genes, using only observational data. This technique enables them to identify causal modules and accurately reconstruct underlying cause-and-effect mechanisms.
“While our research was driven by the need to understand cellular programs, we first developed new causal theory to determine what could be learned from observational data alone. With this foundation, we can apply these insights to genetic data to identify gene modules and their regulatory relationships,” says Uhler.
A Layerwise Representation
Using statistical methods, the researchers calculated the variance for each variable’s Jacobian score. Causal variables that do not influence any subsequent variables should have a variance of zero.
Their approach reconstructs the representation layer by layer, removing variables with zero variance at each stage. This backward elimination helps identify which gene groups are causally linked.
“Identifying which variances are zero quickly becomes a complex combinatorial task, so deriving an efficient algorithm to solve it was a major challenge,” explains Jiaqi Zhang.
The resulting model produces an abstract representation of the observed data, organizing interconnected gene groups into layers that accurately capture cause-and-effect structures. Each variable represents a gene group working together, while relationships between variables indicate how one gene group regulates another. The method retains all information necessary to determine the structure of each variable layer.
After confirming the theoretical soundness of their approach, the researchers conducted simulations to show that their algorithm can efficiently reveal meaningful causal relationships using only observational data.
Moving forward, they plan to apply this technique to real-world genetic datasets. They also intend to explore how the method could offer additional insights when some interventional data are available or aid in designing effective genetic interventions.
Ultimately, this approach could help researchers more efficiently identify gene groups functioning together in specific programs, potentially assisting in drug discovery for targeted disease treatments.