The capacity to examine large numbers of genomes is expensive and time-consuming, thus using genomics to uncover risk factors for major diseases or search for relatives is difficult. A team led by a computer scientist from Johns Hopkins University has created a cloud-based platform that gives researchers quick access to one of the world’s largest genomics datasets.
Michael Schatz. Image Credit: Johns Hopkins University.
The new platform, known as AnVIL (Genomic Data Science Analysis, Visualization, and Informatics Lab-space), provides access to hundreds of analysis tools, patient information, and over 300,000 genomes to any researcher with an Internet connection. The study, which was funded by the National Human Genome Institute, was published on January 12th, 2022, in the journal Cell Genomics.
AnVIL is inverting the model of genomics data sharing, offering unprecedented new opportunities for science by connecting researchers and datasets in new ways and promising to enable exciting new discoveries. Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud.”
Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University
Michael Schatz is the project co-leader.
Genomic analysis often begins with scientists downloading huge volumes of data from centralized databases to their own data centers, a procedure that is not only time-consuming, inefficient, and costly, but also makes working with researchers from other universities difficult. Because genetic risk factors for diseases like cancer and cardiovascular disease are typically modest, researchers must examine the genomes of hundreds of people to find new connections.
A single human genome contains around 40 GB of raw data; hence, downloading hundreds of genomes to undertake such a study can take several days to weeks.
AnVIL will be transformative for institutions of all sizes, especially smaller institutions that don’t have the resources to build their own data centers. It is our hope that AnVIL levels the playing field, so that everyone has equal access to make discoveries”
Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University
Moreover, investigations that require the integration of data acquired at various institutions means each institution downloading its copy while preserving patient data confidentiality. As researchers engage on ever-larger projects requiring the simultaneous analysis of hundreds of thousands to millions of genomes, this problem is projected to become much larger in the coming days.
Connecting to AnVIL remotely eliminates the need for these massive downloads and saves on the overhead. Instead of painfully moving data to researchers, we allow researchers to effortlessly move to the data in the cloud. It also makes sharing datasets much easier so that data can be connected in new ways to find new associations, and it simplifies a lot of computing issues, like providing strong encryption and privacy for patient datasets.”
Michael Schatz, Bloomberg Distinguished Professor, Computer Science and Biology, Johns Hopkins University
AnVIL also offers scientists several important analytic tools, including Galaxy, which was created in part at Johns Hopkins, as well as other prominent tools like R/Bioconductor, Jupyter notebooks, WDLs, Gen3, and Dockstore, which can be used for both interactive analysis and large-scale batch computing. These technologies, taken together, enable researchers to undertake even the most complex projects without having to set up their own computing setups.
The AnVIL platform is presently being used by researchers from all around the world to investigate a number of genetic illnesses, including autism spectrum disorders, cardiovascular disease, and epilepsy. Schatz’s Telomere-to-Telomere Consortium team utilized it to reanalyze hundreds of human genomes with the new reference genome and find over one million additional variations.
The AnVIL team has gathered petabytes of data (one petabyte equals one million GB) from various NHGRI projects, including hundreds of thousands of genomes from the Genotype-Tissue Expression, Centers for Mendelian Genetics, and Centers for Common Disease Genomics, with plans to host many more studies in the near future.
Source:
Journal reference:
Schatz, M. C., et al. (2022) Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space. Cell Genomics. https://doi.org/10.1016/j.xgen.2021.100085