Metagenomic studies focus on assessing microbial diversity, genetic and evolutionary relationships, community constituents, and microbial interaction with the environment.1
This article focuses on bioinformatic tools that help researchers process and derive insights from large and complex metagenomic datasets generated using sequencing technologies, such as shotgun sequencing, high-throughput sequencing, and next-generation sequencing (NGS).
Image Credit: lanatoma/Shutterstock.com
What is metagenomics?
Metagenomics uses sequencing techniques to identify microbes, analyze their genetic composition, and characterize disease-causing agents.1 The key steps of metagenomics involve sample collection, DNA/RNA extraction, and library preparation for sequencing.2
Metagenomics uses random primers containing every possible combination of nucleotides; therefore, all possible hexamers are present, enabling primers to bind to any DNA or RNA molecules in a mixture of genomes.
After the amplification step of library preparation, the polymerase chain reaction (PCR) products are loaded onto an NGS platform to obtain millions of short reads (sequences smaller than 600 bases) or long reads (approximately 1 kb or more).
Short read sequencing is performed in NGS platforms manufactured by Illumina, and ThermoFisher Scientific, and long read sequencing is performed in platforms designed by Oxford Nanopore Technologies and PacBio. The sequence data are analyzed and interpreted using bioinformatic tools.3
A key limitation of metagenomic technique is the need for a large quantity of genomic starting material. Furthermore, there is a high risk of contamination during sample collection and analysis, which increases the possibility of a biased interpretation of the result. However, genomic contamination or cross-contamination can be managed through controls and quality checkpoints.4
What is Bioinformatics?
Bioinformatic tools in metagenomics
Bioinformatic tools play an important role in metagenomics by aiding sequence read quality control (QC), quality trimming, assemblies, and gene predictions. Some key bioinformatic tools used are discussed below.4
Sequence reads quality assessment and trimming
Sequence QC is essential for identifying and removing technical errors in the metagenomic data to avoid false positives and negative results.
The QC step entails pre-processing the sequencing data to eliminate low-quality reads or nucleotides, undesirable adapter sequences, and excessively short reads. This strategy significantly reduces computational time and cost in the following steps.
Bioinformatic programs, such as MultiQC, FastQC, longQC, and MinionQC, are used to QC both long-read and short-read sequencing data. These tools check data quality, and some generate a report summarizing its metrics.
In metagenomic studies, the majority of RNA sequences require trimming of unwanted elements identified in QC programs. Trimmomatic and Cutadapt are bioinformatic tools that are used to remove low-quality reads and adapters.
FastQ-Screen and bowtie2 are used to filter untargeted reads. For instance, research targeting viral reads used these tools to remove any reads belonging to the host genome and contaminates.
Assembly
For taxonomic assignments, it is essential to restore metagenomes. Contigs are designed for metagenome restoration, which are sets of overlapping sequences that offer a longer, continuous sequence.
In metagenomics, de novo genome assembly is commonly used, which is based on overlap layout consensus. It must be noted that the assembly of sequences is quite a complex process and prone to errors.
Several bioinformatic tools, such as MEGAHIT, metaSPADES, and IDBA-UD are used to perform de novo assembly in metagenomic studies. Specific tools used for long-read assembly include NECAT, metaFlye, and Canu. These tools are well adapted to data generated from Nanopore sequencing and PacBio platforms.
For multiple metagenomic studies, various programs (e.g., OPERA-MS and HybridSPADES) have been developed to perform hybrid assembly. Bioinformatic tools, such as BUSCO, MetaQUAST, DeepMAsED, or REAPR, have been designed for QC metagenome assembly.
Metagenome gene prediction
MetaGeneAnnotator is a metagenomic gene-finding algorithm used to predict genes on short sequences from uncharacterized metagenomic communities.5
This tool is based on the assumption that CG content correlates with di-codon frequencies. Apart from gene location, this tool provides information for translation initiation mechanisms, which is useful for studying evolutionary relationships.
Orphelia is another tool with similar functions; however, it has exhibited higher specificity but lower sensitivity in gene prediction than MetaGeneAnnotator. FragGeneScan is designed to predict genes directly from short reads without the need for assembly.
Metagenomics applications in research
Metagenomics enabled the discovery of viruses that were not possible using conventional culturing techniques and microbial surveillance.6 Recently, it helped discover severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), the causal agent of the coronavirus disease 2019 (COVID-19) pandemic.
Metagenomics enables the identification of all pathogens in varied sample types, such as serum, cerebrospinal fluid, plasma, amniotic fluid, sputum, and stool.
In contrast to molecular assays that target a limited number of pathogens using specific primers or probes, metagenomics considers the entire genomic composition (DNA and RNA) to identify microbes.
It can detect bacteria, fungi, viruses, and parasites and help in the diagnosis of infectious diseases.7 Interestingly, metagenomics enabled the identification of viral infections in humans that occurred in the Bronze Age.
Besides clinical applications, researchers use the metagenomics technique to characterize the human gut microbiome, whose composition and abundance are associated with health.
In environmental studies, metagenomic techniques are used to analyze microbes in marine, soil, sewage, and dust samples.8 This field of science is also used in forensic investigation.
In plant science, metagenomics unlocks the relationships between soil microbes and the plant root system.8 This information could be useful to develop unique and eco-friendly biofertilizers and pesticides.
Role of Bioinformatics in Simplifying Data
Conclusion
Bioinformatics is at the heart of metagenomics, transforming how researchers analyze and interpret complex microbial datasets. By enabling efficient processing of sequencing data, bioinformatic tools enhance every stage of metagenomic studies, from quality control and sequence assembly to gene prediction and downstream analysis.
These advancements have broadened the scope of metagenomics, making it a powerful tool for exploring microbial diversity, understanding host-microbe interactions, and addressing challenges across various scientific fields.
From clinical diagnostics to environmental science and agriculture, bioinformatics-driven metagenomics has opened new avenues for innovation and discovery. As bioinformatics tools continue to evolve, they will further refine metagenomic analyses, ensuring more accurate, scalable, and impactful insights.
This synergy between bioinformatics and metagenomics marks a significant leap forward in our ability to harness microbial knowledge for global health, environmental sustainability, and scientific progress.
References
- Zhang, L, et al. Advances in Metagenomics and Its Application in Environmental Microorganisms. Front Microbiol. 2021; 12, 766364. doi.org/10.3389/fmicb.2021.766364
- Thomas T, et al. Metagenomics - a guide from sampling to data analysis. Microb Inform Exp. 2012;2(1):3. doi: 10.1186/2042-5783-2-3.
- Satam H, et al. Next-Generation Sequencing Technology: Current Trends and Advancements. Biology (Basel). 2023;12(7):997. doi: 10.3390/biology13050286.
- Ibañez-Lligoña M, et al. Bioinformatic Tools for NGS-Based Metagenomics to Improve the Clinical Diagnosis of Emerging, Re-Emerging and New Viruses. Viruses. 2023;15(2):587. doi: 10.3390/v15020587.
- Roumpeka DD, et al. A Review of Bioinformatics Tools for Bio-Prospecting from Metagenomic Sequence Data. Front Genet. 2017; 8, 238989. doi.org/10.3389/fgene.2017.00023
- Ko KKK., et al. Metagenomics-enabled microbial surveillance. Nat Microbiol. 2022; 7, 486–496. doi.org/10.1038/s41564-022-01089-w
- Sun C, et al. Clinical application of metagenomic next-generation sequencing for the diagnosis of suspected infection in adults: A cross-sectional study. Medicine (Baltimore). 2024;103(16):e37845. doi: 10.1097/MD.0000000000037845.
- Nam NN, et al. Metagenomics: An Effective Approach for Exploring Microbial Diversity and Functions. Foods. 2023;12(11):2140. doi: 10.3390/foods12112140.
Further Reading