Machine Learning-Based Tool Allows Scientists to Quickly Model Families of Proteins

Microbes drive key processes of life on Earth. They affect global elemental cycles-;the movement of carbon, nitrogen, and other elements. They also promote plant growth and affect the development of diseases. These roles are essential in every ecosystem.

Research constantly expands the database of microbial DNA sequences but does not provide all the biological information about proteins. To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a fuller understanding of the function of proteins and other molecules. Scientists infer the function of a protein by comparing it with reference databases of already characterized proteins.

However, these comparisons are difficult and not scalable to massive databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Snekmer, which allows scientists to quickly model families of proteins.

The Impact

Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, it is incorporated into the DOE KBase framework as a new application that will allow users to annotate their genome and metagenome sequences.

This will help scientists to better model the effects of engineering microbes. That includes these microbes' effect on the climate and their benefits for crop health and bioproduction. Snekmer will also help scientists study the evolution of microbes and patterns in microbiomes.

Summary

The inability of current methods to predict function for 30-50% of bacterial protein sequences is a significant barrier to better understanding of complex systems such as soil microbiomes. Most protocols rely on pair-wise alignments, which are becoming computationally intractable and more challenging to interpret as databases expand.

For alignment-based models of protein families, the sensitivity and accuracy depend on initial training sets, which risk obsolescence as additional sequence diversity is discovered. Many bacterial proteins have either no functional assignment or are only assigned a general function based solely on taxonomic understanding.

To address this need, researchers at Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, a software tool leveraging redundancy of amino acid residue properties to reduce sequence space and using short protein sequence (kmer) features for machine learning to generate protein family models.

Snekmer users can recode protein sequences into reduced alphabet kmer vectors and perform construction of supervised classification models trained on input protein families, or protein functional classification based on Snekmer models.

Source:
Journal reference:

Christine, H. C., et al. (2023) Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding. Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad005.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Study Unlocks Mechanisms Behind Nucleosome Arrays Using Signal Theory and Machine Learning