Ribonucleic acid (RNA) is critical in regulating plant gene expression, growth, and stress responses, with structural motifs often determining functionality. However, identifying functional RNA motifs across complex plant transcriptomes is challenging due to their structural diversity and sequence combinations.
In a recent study published in Nature Machine Intelligence, a team of researchers from China and the United Kingdom introduced a plant-specific RNA foundation model (FM) called PlantRNA-FM that was designed to systematically decode RNA sequence and structure features. The findings showed that by enabling transcriptome-wide analysis of functional RNA motifs, the model aids the exploration of RNA’s role in translation and its regulatory mechanisms.
Background
RNA is integral to cellular function because it encodes regulatory elements that influence gene expression, translation, and structural organization. In plants, secondary and tertiary RNA structures, such as G-quadruplexes, play vital roles, especially under conditions favoring structural stability, such as low temperatures.
Despite the importance of these motifs, systematically identifying functional RNA structures remains a challenge due to the vast sequence complexity of plant transcriptomes and experimental limitations. Computational approaches, including foundational models in genomics, have improved our understanding of molecular biology. However, these models often focus on deoxyribonucleic acid (DNA) or protein sequences, and the unique structural contributions of RNA are frequently overlooked.
The Current Study
In the present study, the researchers introduced PlantRNA-FM, an advanced computational model designed to analyze plant RNA sequences and structures. The model was pre-trained using transcriptomic data from 1,124 plant species, incorporating approximately 25 million RNA sequences and 54.2 billion nucleotides.
The program RNAfold was utilized to predict secondary structures and ensure that structural motifs remained intact during the analysis. The model architecture was based on a transformer framework that used millions of parameters, numerous layers, and attention mechanisms, and was optimized for sequence and structural representation rather than generation.
The pre-training involved three objectives — masked nucleotide modeling, RNA secondary structure prediction, and RNA annotation classification. Masked nucleotide modeling improved the model’s ability to reconstruct missing nucleotides, while secondary structure prediction relied on annotated structural data for accuracy. RNA annotation classification enabled differentiation between 5′ untranslated regions (UTR), coding sequences, and 3′ UTR regions.
The researchers developed an interpretable framework for identifying functional motifs using attention contrast matrices to highlight critical RNA features, especially in the 5′ UTR region. The study also applied hierarchical clustering to RNA structure motifs, focusing on their translation roles. Additional experimental validation involved a dual-luciferase reporter assay to confirm the biological relevance of predicted motifs.
The researchers explored the model’s capacity to integrate sequence and structure information. They used its interpretable framework to systematically identify translation-associated RNA motifs across transcriptomes, including secondary structures and RNA G-quadruplexes.
Major Findings
The study found that PlantRNA-FM could effectively identify and analyze functional RNA motifs in plants, and its accuracy and interpretability surpassed that of existing models. Furthermore, pre-training the model on a vast dataset helped achieve F1 scores (which measures the models accuracy based on precision and recall) of 0.958 and 0.974 for genic region annotation in rice and Arabidopsis, respectively, which was significantly greater than that of other models.
Additionally, PlantRNA-FM attained F1 scores of 0.735 for Arabidopsis and 0.737 for rice in translation efficiency prediction, which highlighted its capability to discern functional RNA features. The model's attention contrast framework also revealed that nucleotides near the start codon in 5′ UTR regions are crucial for translation.
The model also identified conserved Kozak sequence motifs — nucleic acid motifs that initiate protein translation in eukaryotic messenger RNA — in Arabidopsis and rice. Furthermore, PlantRNA-FM systematically uncovered 112 secondary structure motifs, categorized into high- and low-translation-associated motifs based on their base pair characteristics.
High-translation motifs had balanced guanine-cytosine (GC) and adenine-uracil (AU) pairs, whereas low-translation motifs were enriched in GC pairs. Furthermore, the experimental validation confirmed the functional relevance of these motifs in translation regulation.
The model identified translation-associated RNA G-quadruplex motifs, which act as translation repressors. Disruption of these motifs through experimental assays resulted in up to a 5.8-fold increase in translation efficiency, further validating their regulatory role.
Overall, the results demonstrated PlantRNA-FM’s unique ability to integrate sequence and structural data and identify RNA motifs that influence translation across diverse plant transcriptomes with precision.
Conclusions
The study introduced PlantRNA-FM, a robust model that integrated RNA sequence and structure data to identify functional RNA motifs in plants. The model demonstrated superior performance in predicting genic regions and translation efficiency. The findings suggested that by revealing critical RNA features, including secondary and tertiary structures, PlantRNA-FM could advance our understanding of the regulatory roles of RNA.
Journal reference:
Yu, H., Yang, H., Sun, W., Yan, Z., Yang, X., Zhang, H., Ding, Y., & Li, K. (2024). An interpretable RNA foundation model for exploring functional RNA motifs in plants. Nature Machine Intelligence. DOI:10.1038/s4225602400946z,
https://www.nature.com/articles/s42256-024-00946-z