Monitoring the origin of the synthetic genetic code has never been easy, but it can be performed via bioinformatics or, increasingly, deep-learning computational techniques.
Todd Treangen. Image Credit: Tommy LaVergne/Rice University.
While the latter technique gets the largest share of focus, a new study performed by Todd Treangen, a computer scientist from the Brown School of Engineering of Rice University is focused on whether pan-genome and sequence-alignment-based techniques will surpass the deep learning methods in this field.
This is, in a sense, against the grain given that deep learning approaches have recently outperformed traditional approaches, such as BLAST. My goal with this study is to start a conversation about how to combine the expertise of both domains to achieve further improvements for this important computational challenge.”
Todd Treangen, Computer Scientist, Brown School of Engineering, Rice University
Treangen, who deals in designing computational solutions for microbial forensics and biosecurity applications, and his research team from Rice University have launched a bioinformatics method—called Plasmid Hawk—that examines the sequences of DNA to help detect the source of tarter engineered plasmids.
We show that a sequence alignment-based approach can outperform a convolutional neural network (CNN) deep learning method for the specific task of lab-of-origin prediction.”
Todd Treangen, Computer Scientist, Brown School of Engineering, Rice University
The research team, headed by Treangen and Qi Wang, the study lead author and a graduate student from Rice, University, has described their findings in an open-access paper in the Nature Communications journal. The program could be useful for monitoring possibly lethal engineered sequences and also for protecting intellectual property.
“The goal is either to help protect intellectual property rights of the contributors of the sequences or help trace the origin of a synthetic sequence if something bad does happen,” added Todd.
Treangen observed a new high-profile article explaining a recurrent neural network (RNN) deep learning technique to map the emerging laboratory of a sequence. The technique achieved a precision of 70% in estimating a single laboratory of origin.
“Despite this important advance over the previous deep learning approach, PlasmidHawk offers improved performance over both methods,” Todd added.
The program developed by Rice University directly aligns unidentified strings of code from genome data sets and compares them to pan-genomic areas that are unique or common to synthetic biology research laboratories.
“To predict the lab-of-origin, PlasmidHawk scores each lab based on matching regions between an unclassified sequence and the plasmid pan-genome, and then assigns the unknown sequence to a lab with the minimum score,” explained Wang.
In the recent analysis, the team used the same dataset as one of the deep learning experiments and reported the effective prediction of “unknown sequence” depositing labs” 76% of the time. The team noted that 85% of the time, the right laboratory was in the top 10 candidates.
The team added that unlike the deep learning methods, the PlasmidHawk approach involves decreased pre-processing of data and does not require retraining when new sequences are introduced to an existing study. The PlasmidHawk approach also varies by providing a thorough description of its lab-of-origin predictions as opposed to the earlier deep-learning methods.
The goal is to fill your computational toolbox with as many tools as possible. Ultimately, I believe the best results will combine machine learning, more traditional computational techniques and a deep understanding of the specific biological problem you are tackling.”
Ryan Leo Elworth, Study Co-Author and Postdoctoral Researcher, Rice University
Source:
Journal reference:
Wanq, Q., et al. (2021) PlasmidHawk improves lab of origin prediction of engineered plasmids using sequence alignment. Nature Communications. doi.org/10.1038/s41467-021-21180-w.