A scientist from the University of Virginia School of Medicine, in collaboration with global partners, has developed a new tool designed to streamline genomic research and accelerate the development of methods to improve human health.
Nathan Sheffield, PhD, has spent the past four years creating a data standard that helps scientists make more accurate comparisons during genomic analysis—a critical step in understanding how human cells function. By improving the clarity and consistency of genomic data, the tool enhances insight into both healthy and diseased cells, ultimately supporting the discovery of new treatments and preventive strategies.
The Genomics Challenge
Genomics involves the analysis of vast and complex datasets. This complexity is amplified by the number of researchers working across different labs and the inconsistent naming conventions used for “reference sequences”—the genetic baselines against which researchers compare individual variations.
Reference sequences typically represent a compilation of genetic data from multiple individuals. They’re essential for identifying gene variants linked to disease and for understanding how cell behavior changes in diseased states. But inconsistencies in how these references are named and tracked can lead to confusion, misinterpretation, and inefficiencies across studies.
Sheffield’s solution—called refget Sequence Collections—addresses this problem by creating a standard way to define and compare reference sequences. The tool enables researchers to quickly identify and verify the references used in their analyses, improving reproducibility and fostering more effective collaboration across the scientific community.
“Imagine a class where each student has a slightly different version of the textbook—the words vary, page numbers don’t match, and chapter titles are shuffled,” said Sheffield. “It would be hard for anyone to have a productive discussion. That’s what happens in genomics when researchers use different versions of reference sequences.”
“But if students could identify each version precisely and compare differences easily, communication and collaboration would become much simpler. That’s what refget Sequence Collections aims to do for genomic data.”
A Tool for Reproducibility and Collaboration
For many researchers, determining the exact reference sequence used in published studies can be frustrating and time-consuming. The process often requires manual guesswork, even though it seems ripe for automation. Sheffield’s tool eliminates much of that burden, ensuring scientists are comparing their data against the same genetic standards.
The tool builds on previous work by the Global Alliance for Genomics and Health (GA4GH), a nonprofit organization that develops standards and policies for the responsible use of genomic data. GA4GH had previously introduced “refget sequences,” which assign unique identifiers to individual genetic sequences. Sheffield’s advancement takes this concept further by defining collections of sequences—such as the full set of DNA in a reference genome—under unified identifiers.
By bringing more structure and automation to genomic workflows, Sheffield hopes this innovation will help overcome longstanding challenges in data integration, analysis, and sharing.
“I hope this standard helps solve some of the difficulties the scientific community has faced integrating genomic and epigenomic data,” he said. “With an approved, standardized way to refer to reference sequences, we can speed up discoveries by making it easier to compare results across experiments.”
This tool joins more than 40 genomic research resources developed by GA4GH collaborators and marks a key step forward in building a more efficient, connected ecosystem for genomic science.