Designing novel proteins with specific and desired functions is one of the major aims of biotechnology and medicine.
However, it presents a complex challenge since protein function is dependent not only on the sequence of amino acids but also on the three-dimensional (3D) protein structure, which needs to be explored simultaneously.
In a recent study published in Nature Biotechnology, a team of scientists from the United States developed a sequence space diffusion model using deep-learning approaches to explore the structure and sequences of proteins, potentially improving the design of multifunctional proteins.
Study: Multistate and functional protein design using RoseTTAFold sequence space diffusion. Image Credit: Corona Borealis Studio/Shutterstock.com
Background
Proteins are essential molecules for all life forms. They carry out a wide range of functions, from providing structural building blocks and catalyzing biological reactions to facilitating communication between cells and organs.
The unique 3D structure of proteins dictates their function, and the structure, in turn, is dependent on the amino acid sequence constituting the protein.
Substantial biotechnological research is focused on designing proteins with specific functions. However, exploring the sequence and structural aspects of proteins simultaneously has been challenging, although various computational methods have been designed to address this challenge.
Deep learning-based denoising diffusion probabilistic models or DDPMs are generative models that can generate new protein backbones by initially adding noise or random changes to the structures and gradually removing that noise, improving flexibility in the designing process.
While DDPMs have been applied to a wide range of fields, their use in protein design has been limited.
About the study
In the present study, the researchers hypothesized that an approach using DDPMs focused on sequence-space diffusion would enable the design of proteins based on both sequence and structural features and improve the ability to create proteins with multiple possible folds and functions.
They used a deep learning-based software tool called RoseTTAFold that can predict protein structure using limited information and adapted it for sequence-space diffusion.
The study involved the fine-tuning and implementation of DDPMs for protein sequence generation and design. The amino acid sequences were represented as numbers, called one-hot tensors, where the correct sequences were assigned a value of 1, and the others were set to -1. This allowed the model to gradually add noise or random changes to the data, which were then denoised step by step.
The process comprised a gradual introduction of noise into the sequence according to a specific pattern. The DDPM is trained to decipher the correct sequence and structure of the protein by minimizing two kinds of errors — a categorical cross-entropy loss, which is for the sequence itself, and a frame-aligned point error, which is for the protein structure.
The model used a random sequence with noise and an empty protein structure simultaneously but gradually refined the sequence and the structure for more accuracy, better predictions, and lower noise.
Additionally, extra information in the form of sequence and structural guidance was used to make the model’s predictions more reliable.
The results from the models were then tested through experiments to ensure the correct folding and stability of the protein. Additionally, folding analyses and solubility tests were performed to verify the properties of the proteins.
Major findings
The researchers developed a novel protein sequence generation approach called ProteinGenerator, using a denoising diffusion probabilistic model.
The results showed that ProteinGenerator significantly outperformed earlier protein design models and generated structurally diverse proteins with specific structures and properties.
The method incorporated important structural motifs into the designed proteins, with the predictions of these structures being highly accurate with differences as small as 2 angstroms. Furthermore, the amino acid compositions in the newly generated proteins were similar to naturally found proteins.
Of the 42 proteins designed by ProteinGenerator and tested through experiments, 32 were found to be soluble and monomeric, indicating that they did not clump in the solution.
The proteins were also stable at temperatures as high as 95 °C, suggesting that this approach could design proteins that would be robust in real-life conditions.
Even the proteins with rare amino acids, such as valine, cysteine, and tryptophan, designed by ProteinGenerator, were stable and soluble. The ones containing cysteine were able to form disulfide bonds, which increased their stability. These proteins also formed the expected secondary structures, such as beta sheets and alpha helices.
The results showed that ProteinGenerator was able to adjust specific properties of the proteins, such as hydrophobicity and isoelectric point, which could be valuable for drug designing and producing therapeutic proteins.
The approach was also successful in designing repeat proteins that are useful for cellular recognition and signaling, proteins that release active peptides, and adding barcodes to proteins for identification.
Conclusions
In summary, the findings indicated that the DDPM-based protein sequence generation tool ProteinGenerator could create a diverse array of functional proteins with specific properties, which were also stable in a wide range of conditions.
These newly generated proteins also closely matched natural proteins and the diversity of proteins generated through this approach highlighted its utility in drug discovery and other biomedical fields.
Journal reference:
-
Lisanza, S. L., Gershon, J. M., Samuel, Sims, J. N., Arnoldt, L., Hendel, S. J., Simma, M. K., Liu, G., Yase, M., Wu, H., Tharp, C. D., Li, X., Kang, A., Brackenbrough, E., Bera, A. K., Gerben, S., Wittmann, B. J., McShan, A. C., & Baker, D. (2024). Multistate and functional protein design using RoseTTAFold sequence space diffusion. Nature Biotechnology. doi:10.1038/s4158702402395w. https://www.nature.com/articles/s41587-024-02395-w