Scientists have developed an AI system that can produce artificial enzymes from scratch. Even though several of these enzymes’ artificially generated amino acid sequences differed noticeably from those of any known natural protein, they nonetheless performed as well in laboratory testing as those found in nature.
The experiment shows that, despite being designed to read and create English text, natural language processing can pick up on at least some of the fundamental concepts of biology. The artificial intelligence (AI) tool ProGen, created by Salesforce Research, assembles amino acid sequences into artificial proteins via next-token prediction.
The new technology, according to scientists, could surpass directed evolution, a protein design technique that won a Nobel Prize, and it will revitalize the 50-year-old field of protein engineering by accelerating the development of new proteins that can be used for practically anything, from therapeutics to degrading plastic.
The artificial designs perform much better than designs that were inspired by the evolutionary process.”
James Fraser, PhD, Study Author and Professor, Bioengineering and Therapeutic Sciences, School of Pharmacy, University of California San Francisco
The study was published on January 26th, 2023 in Nature Biotechnology. Since July 2021, an earlier version of the manuscript has been accessible on the preprint service BiorXiv, where it received several dozen citations before being printed in a peer-reviewed publication.
Fraser added, “The language model is learning aspects of evolution, but it is different than the normal evolutionary process. We now have the ability to tune the generation of these properties for specific effects. For example, an enzyme that is incredibly thermostable or likes acidic environments or won’t interact with other proteins.”
The amino acid sequences of 280 million diverse proteins of all sorts were simply loaded into the machine learning model to develop the model, which was then given a few weeks to process the data.
After that, they adjusted the model by feeding it 56,000 sequences from five different lysozyme families along with some background knowledge about these particular proteins.
The model produced a million sequences in a short period of time, out of which the study team chose 100 to test based on how closely they mirrored the sequences of natural proteins and how realistically the underlying amino acid “grammar” and “semantics” of the AI proteins were.
Out of this initial batch of 100 proteins, which Tierra Biosciences evaluated in vitro, the team created five artificial proteins to test in cells and compared their function to an enzyme known as hen egg white lysozyme (HEWL), which is present in the whites of chicken eggs.
Human tears, saliva, and milk all contain similar lysozymes that operate as antimicrobial defenses against bacteria and fungi.
Even though just two of the artificial enzymes had sequences that were around 18% similar to one another, they were still able to degrade bacterial cell walls with an activity that was equal to HEWL. About 90% and 70% of all known proteins were similar to the two sequences.
In a subsequent round of screening, the scientists discovered that the AI-generated enzymes displayed functionality even when as little as 31.4% of their sequence resembled any known natural protein. A single mutation in a natural protein can cause it to stop functioning.
By analyzing the raw sequence data, the AI was even able to determine how the enzymes should be formed. The artificial proteins’ atomic structures, as determined by X-Ray crystallography, seemed just as they should, despite the fact that their sequences were novel.
In 2020, Salesforce Research created ProGen based on a type of natural language programming that their researchers had initially created to produce text in English.
They already knew from their earlier research that the AI system was capable of teaching itself the fundamental principles of good composition, including syntax and word meaning.
When you train sequence-based models with lots of data, they are really powerful in learning structure and rules. They learn what words can co-occur, and also compositionality.”
Nikhil Naik, PhD, Study Senior Author and Director, AI Research, Salesforce Research
The design options for proteins were almost endless. As far as proteins go, lysozymes are small, containing up to 300 amino acids. However, given that there are 20 different amino acids, there are a staggering 20300 potential combinations.
That is more than the sum of all the people who have ever lived, the number of sand grains on Earth, and the number of atoms in the cosmos.
Amazingly, the model can produce functional enzymes with such ease given the infinite possibilities.
The capability to generate functional proteins from scratch out-of-the-box demonstrates we are entering into a new era of protein design. This is a versatile new tool available to protein engineers, and we are looking forward to seeing the therapeutic applications.”
Ali Madani, PhD, Study First Author and Founder, Profluent Bio
Madani formerly held the position of research scientist at Salesforce Research.
Source:
Journal reference:
Madani, A., et al. (2023). Large language models generate functional protein sequences across diverse families. Nature Biotechnology. doi.org/10.1038/s41587-022-01618-2