To design proteins with beneficial functions, researchers typically start with a natural protein possessing a desirable function, such as emitting fluorescent light. They then subject it to numerous rounds of random mutations, ultimately producing an optimized version of the protein.
Green fluorescent protein (GFP) is one of the many significant proteins that have been produced through this process in optimized forms. Nevertheless, producing an optimized version has proven to be challenging for other proteins. Based on a comparatively small amount of data, MIT researchers have now developed a computational approach that makes predicting mutations that will lead to better proteins is easier.
Using this model, the scientists produced a protein from the adeno-associated virus (AAV), which carries DNA for gene therapy, and proteins with mutations predicted to result in better versions of GFP. They anticipate using it to create more instruments for medical and neuroscience research.
Protein design is a hard problem because the mapping from DNA sequence to protein structure and function is really complex. There might be a great protein 10 changes away in the sequence, but each intermediate change might correspond to a totally nonfunctional protein. It’s like trying to find your way to the river basin in a mountain range, when there are craggy peaks along the way that block your view. The current work tries to make the riverbed easier to find.”
Ila Fiete, Professor and Study Senior Author, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
Fiete is a member of MIT’s McGovern Institute for Brain Research, and Director of the K. Lisa Yang Integrative Computational Neuroscience Center.
Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health at MIT, and Tommi Jaakkola, the Thomas Siebel Professor of Electrical Engineering and Computer Science at MIT, are also senior authors of an open-access paper on the research, which will be presented at the International Conference on Learning Representations in May.
MIT graduate students Andrew Kirjner and Jason Yim are the study's lead authors. Other authors include Shahar Bracha, an MIT postdoc, and Raman Samusevich, a graduate student at Czech Technical University.
Optimizing Proteins
A little more engineering is required to optimize the functions of many naturally occurring proteins, which may make them valuable for study or medical applications. This study initially aimed to develop proteins that could serve as voltage indicators in living cells.
When an electric potential is detected, these proteins are made by certain bacteria and algae-emit fluorescent light. Researchers may be able to measure neuron activity without the need for electrodes if such proteins are designed for use in mammalian cells.
These proteins have been engineered over decades of research to produce a stronger fluorescent signal on a faster timescale, but their effectiveness has not increased to the point where they can be used widely.
Bracha, a researcher at the McGovern Institute who works with Edward Boyden, contacted Fiete’s lab to discuss collaborating on a computational strategy that could expedite the protein optimization process.
This work exemplifies the human serendipity that characterizes so much science discovery. It grew out of the Yang Tan Collective retreat, a scientific meeting of researchers from multiple centers at MIT with distinct missions unified by the shared support of K. Lisa Yang. We learned that some of our interests and tools in modeling how brains learn and optimize could be applied in the different domain of protein design, as being practiced in the Boyden lab.”
Ila Fiete, Professor and Study Senior Author, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology
By substituting different amino acids at each position in the sequence, researchers can create an almost infinite number of sequences for any given protein they wish to optimize. Since it is not feasible to test every potential variation experimentally, researchers have resorted to computational modeling in an attempt to forecast which will perform best.
Using information from GFP, the researchers created and tested a computational model that could forecast more advanced forms of the protein to address those difficulties.
First, they used experimental data containing GFP sequences and their brightness - the feature they wished to optimize - to train a particular kind of model called a convolutional neural network (CNN).
With only a small amount of experimental data (roughly 1,000 GFP variants), the model produced a “fitness landscape,” a three-dimensional map showing the fitness of a particular protein and how much it deviates from the original sequence.
Peaks in these landscapes indicate more fit proteins, and valleys indicate less fit proteins. Because proteins frequently need to undergo a mutation that reduces their fitness before reaching a nearby peak of higher fitness, predicting the path a protein must take to reach the fitness peaks can be challenging. The researchers “smoothed” the fitness landscape using a computational method already used to get around this issue.
The CNN model was found to be more adept at reaching higher fitness peaks after the researchers retrained it after these minor landscape irregularities were smoothed out. The best of these proteins were estimated to be roughly 2.5 times fitter than the original protein sequence.
The model was able to predict optimized GFP sequences with up to seven different amino acids from the protein sequence they started with.
Once we have this landscape that represents what the model thinks is nearby, we smooth it out and then we retrain the model on the smoother version of the landscape. Now there is a smooth path from your starting point to the top, which the model is now able to reach by iteratively making small improvements. The same is often impossible for unsmoothed landscapes.”
Andrew Kirjner, Graduate Student and Study Lead Author, Massachusetts Institute of Technology