Researchers at the University of Wisconsin–Madison warn that increasing use of artificial intelligence in genetics and medicine could lead to incorrect conclusions about the relationship between genes and physical traits, such as disease risk factors for conditions like diabetes.
The application of AI in genome-wide association studies, which seek to identify connections between genetic variations and physical traits, has been linked to inaccurate predictions. These studies, often involving large datasets from sources like the UK Biobank and the NIH’s All of Us project, aim to find potential links between genetic differences and specific diseases. However, these databases often lack detailed data on some medical conditions that researchers wish to study.
Complexities of Genetic Disease Links
While genetics play a role in many medical conditions, the connection between genetic variations and physical traits is often complex. Genome-wide association studies have made progress in identifying some genetic links to disease by using extensive databases. However, gaps remain in the data available for certain health conditions, limiting the statistical strength of some findings.
“Certain traits are costly or difficult to measure, so we often don’t have sufficient samples to draw reliable statistical conclusions about their genetic associations,”
Qiongshi Lu, Associate Professor in Biostatistics at UW–Madison.
Risks of Relying on AI to Fill Data Gaps
To address data gaps, researchers increasingly rely on advanced machine learning models to predict complex traits and disease risks with limited data. However, Lu and his team have shown that this approach carries risks if biases in AI models are not addressed. In a recent study published in Nature Genetics, they demonstrate how a widely-used machine learning technique can inadvertently link numerous genetic variants to Type 2 diabetes risk.
“If you trust the AI-predicted diabetes risk as the actual risk, you may believe all these genetic variations are correlated with diabetes, even when they are not,” Lu explains.
New Statistical Method to Reduce AI-Generated False Positives
Lu and his colleagues not only identify the risks of over-reliance on AI tools but also propose a new statistical method to help reduce false positives in AI-assisted genome-wide association studies. This approach, described as “statistically optimal,” helps counteract potential biases in machine learning models and provides more reliable results in studies where data is limited.
“This new strategy is statistically optimal,” Lu notes, adding that they used it to more accurately identify genetic connections with bone mineral density.
Beyond AI: Issues in Proxy-Based Genome-Wide Association Studies
In addition to AI-related challenges, Lu’s team found issues with studies that fill data gaps using proxy data rather than direct measurements. For instance, some researchers attempt to link genetics to Alzheimer’s disease risk by using family health history as a substitute for actual diagnostic data. This approach can lead to misleading correlations, such as an erroneous link between higher cognitive abilities and Alzheimer’s risk.
“Today’s genomic researchers often work with biobank datasets containing hundreds of thousands of individuals,” Lu explains. “While this increases statistical power, it also raises the potential for bias and error in large datasets. Our recent studies emphasize the need for rigorous statistical approaches in biobank-scale research.”
Source:
Journal references:
Miao, J., et al. (2024) Valid inference for machine learning-assisted genome-wide association studies. Nature Genetics. doi.org/10.1038/s41588-024-01934-0.
Wu, Y., et al. (2024) Pervasive biases in proxy genome-wide association studies based on parental history of Alzheimer’s disease. Nature Genetics. doi.org/10.1038/s41588-024-01963-9.