Cloud-based platform enables easy access to global genomics databases

Download PDF Copy

Reviewed

Reviewed by Emily Henderson, B.Sc.Jan 13 2022

The capacity to examine large numbers of genomes is expensive and time-consuming, thus using genomics to uncover risk factors for major diseases or search for relatives is difficult. A team led by a computer scientist from Johns Hopkins University has created a cloud-based platform that gives researchers quick access to one of the world’s largest genomics datasets.

Cloud-based platform enables easy access to global genomics databases

Michael Schatz. Image Credit: Johns Hopkins University.

The new platform, known as AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space), provides access to hundreds of analysis tools, patient information, and over 300,000 genomes to any researcher with an Internet connection. The study, which was funded by the National Human Genome Institute, was published on January 12^th, 2022, in the journal Cell Genomics.

AnVIL is inverting the model of genomics data sharing, offering unprecedented new opportunities for science by connecting researchers and datasets in new ways and promising to enable exciting new discoveries. Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud.”

Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University

Michael Schatz is the project co-leader.

Genomic analysis often begins with scientists downloading huge volumes of data from centralized databases to their own data centers, a procedure that is not only time-consuming, inefficient, and costly, but also makes working with researchers from other universities difficult. Because genetic risk factors for diseases like cancer and cardiovascular disease are typically modest, researchers must examine the genomes of hundreds of people to find new connections.

A single human genome contains around 40 GB of raw data; hence, downloading hundreds of genomes to undertake such a study can take several days to weeks.

AnVIL will be transformative for institutions of all sizes, especially smaller institutions that don’t have the resources to build their own data centers. It is our hope that AnVIL levels the playing field, so that everyone has equal access to make discoveries”

Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University

Moreover, investigations that require the integration of data acquired at various institutions means each institution downloading its copy while preserving patient data confidentiality. As researchers engage on ever-larger projects requiring the simultaneous analysis of hundreds of thousands to millions of genomes, this problem is projected to become much larger in the coming days.

Connecting to AnVIL remotely eliminates the need for these massive downloads and saves on the overhead. Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud. It also makes sharing datasets much easier so that data can be connected in new ways to find new associations, and it simplifies a lot of computing issues, like providing strong encryption and privacy for patient datasets.”

Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University

AnVIL also offers scientists several important analytic tools, including Galaxy, which was created in part at Johns Hopkins, as well as other prominent tools like R/Bioconductor, Jupyter notebooks, WDLs, Gen3, and Dockstore, which can be used for both interactive analysis and large-scale batch computing. These technologies, taken together, enable researchers to undertake even the most complex projects without having to set up their own computing setups.

The AnVIL platform is presently being used by researchers from all around the world to investigate a number of genetic illnesses, including autism spectrum disorders, cardiovascular disease, and epilepsy. Schatz’s Telomere-to-Telomere Consortium team utilized it to reanalyze hundreds of human genomes with the new reference genome and find over one million additional variations.

The AnVIL team has gathered petabytes of data (one petabyte equals one million GB) from various NHGRI projects, including hundreds of thousands of genomes from the Genotype-Tissue Expression, Centers for Mendelian Genetics, and Centers for Common Disease Genomics, with plans to host many more studies in the near future.

Source:

Johns Hopkins University

Journal reference:

Schatz, M. C., et al. (2022) Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genomics. https://doi.org/10.1016/j.xgen.2021.100085

Posted in: Genomics | Life Sciences News

Comments (0)

Download PDF Copy

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.