Rapid, inexpensive technique helps search large DNA databases

Download PDF Copy

Reviewed

Rice UniversityJun 30 2021

Computer scientists from Rice University are deploying RAMBO—which stands for “repeated and merged bloom filter”—to help genomic researchers who at times wait for days or even weeks for search results from large DNA databases.

DNA

DNA. Image Credit: Peshkova/Shutterstock.com

DNA sequencing is so famous that the genomic datasets are increasing in size by two-fold every two years. However, the tools used to search the data have not kept pace with this advancement.

Scientists studying the evolution of organisms, such as viruses that cause COVID-19 infection, or comparing DNA across genomes, often have to wait for weeks for software to index enormous, “metagenomic” databases, which become bigger every month and are currently quantified in petabytes.

RAMBO is a novel method that reduces the indexing times for large databases from weeks to hours and also the search times from hours to seconds. RAMBO was recently presented at the Association for Computing Machinery data science conference SIGMOD 2021 by computer scientists from Rice University.

Querying millions of DNA sequences against a large database with traditional approaches can take several hours on a large compute cluster and can take several weeks on a single server. Reducing database indexing times, in addition to query times, is crucially important as the size of genomic databases are continuing to grow at an incredible pace.”

Todd Treangen, Co-Creator of RAMBO and Computer Scientist, Rice University

Treangen’s laboratory specializes in metagenomics.

To find a solution, Treangen collaborated with Anshumali Shrivastava, a computer scientist from Rice University who specializes in developing algorithms that make machine learning and big data faster and more scalable. Graduate students Gaurav Gupta and Minghao Yan, the co-lead authors of the peer-reviewed conference article on RAMBO, were also part of the study.

RAMBO employs a data structure that has a considerably faster query time than the latest genome indexing methods and also offers other benefits, like a zero false-negative rate, ease of parallelization, and a low false-positive rate.

Gaurav Gupta, one of the co-lead authors of the study, stated, “The search time of RAMBO is up to 35 times faster than existing methods.” Gupta is a doctoral student in electrical and computer engineering.

In experiments utilizing a 170-terabyte dataset of microbial genomes, RAMBO decreased the indexing times from “six weeks on a sophisticated, dedicated cluster to nine hours on a shared commodity cluster,” added Gupta.

On this huge archive, RAMBO can search for a gene sequence in a couple of milliseconds, even sub-milliseconds using a standard server of 100 machines.”

Minghao Yan, Study Co-Lead Author and Master’s Student in Computer Science, Rice University

RAMBO enhances the performance of Bloom filters, a technique that has been used for a half-century to look for genomic sequences in many previous studies. By using a probabilistic data structure called a count-min sketch, RAMBO improves on previous Bloom filter techniques for genomic search.

This probabilistic data structure “leads to a better query time and memory trade-off” than earlier methods and “beats the current baselines by achieving a very robust, low-memory and ultrafast indexing data structure,” wrote the study authors.

According to Gupta and Yan, RAMBO has the ability to democratize genomic search by making it viable for almost any laboratory to rapidly and economically look for large genomic archives with off-the-shelf computers.

RAMBO could decrease the wait time for tons of investigations in bioinformatics, such as searching for the presence of SARS-CoV-2 in wastewater metagenomes across the globe. RAMBO could become instrumental in the study of cancer genomics and bacterial genome evolution, for example.”

Minghao Yan, Study Co-Lead Author and Master’s Student in Computer Science, Rice University

Source:

Rice University

Journal reference:

Gupta, G., et al. (2021) Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO). SIGMOD/PODS '21: Proceedings of the 2021 International Conference on Management of Data. doi.org/10.1145/3448016.3457333.

Posted in: Genomics | Life Sciences News