Phenotypic diversity is largely a result of differences in splicing and gene expression due to variations in the genome, and correlating changes in the deoxyribonucleic acid (DNA) to altered gene expression or splicing patterns has helped uncover the relationship between genes, molecular traits, and phenotypes.
However, most molecular association studies overrepresent individuals of European ancestry.
In a recent study published in Nature, a team of scientists from Johns Hopkins University developed an open-access dataset of ribonucleic acid (RNA) sequences from lymphoblastoid cell lines of 731 individuals spanning 26 populations and five continental groups.
Study: Sources of gene expression variation in a globally diverse human cohort. Image Credit: james benjamin/Shutterstock.com
Background
The phenotypic variations that arise between conspecifics and between species are largely a result of genetic variations that impact RNA splicing and gene expression.
Molecular association studies have explored the genetic basis of phenotypic traits by observing correlations between changes in splicing and expression and phenotypic variations.
However, many early molecular association studies have been on populations of European ancestry, which introduces a significant bias in the results and makes it difficult to generalize the findings to other populations or understand the diversity and evolution of gene expression as a whole.
The lack of diversity in the study population also potentially introduces the problem of linkage disequilibrium.
About the study
In the present study, researchers from Johns Hopkins University developed an open-access database of gene expression data from 26 populations of different ancestries.
This multi-ancestry gene expression analysis resource, MAGE, consists of RNA sequence data from lymphoblastoid cell lines obtained from human populations belonging to five continental groups.
The dataset's ancestry was analyzed using tools that used allele frequencies to determine the number of ancestral populations. At the same time, the genotype data was obtained from published results of the 1,000 Genomes Project.
RNA sequences were obtained from pathogen—and contaminant-free lymphoblastoid cell lines. The researchers used this data and gene annotations to quantify gene expression and filter the genes expressed at low levels. Reference genomes were also used to quantify and assess the distribution of splicing.
The phenotype data was then used to map the expression quantitative trait loci (QTL) and splicing QTL to X chromosomes and autosomes.
The genes containing the expression QTLs and the introns of all the genes containing splicing QTLs were then fine-mapped to determine the causal variants driving the QTL signals.
The putative causal QTL signals were explored further to determine the functional and epigenomic enrichments.
Genome-wide association studies were then examined to determine if their results shared signals with the splicing and expression QTLs fine-mapped using MAGE and to explore the genetic variations involved in the expression of complex human traits.
Furthermore, the researchers also examined population-specific QTLs to answer the fundamental question of how widely genetic associations are replicated in human groups and to what extent underlying factors drive the between-group heterogeneity.
The frequency distribution of the QTLs was also evaluated across the five continental groups in the study population and compared against the results of the Genotype-Tissue Expression (GTEx) project, which consists largely of data from European and some African-American ancestries.
Major findings
The study showed that when combined with whole-genome sequence data, the large, multi-ancestry, open-access dataset MAGE developed in this study provides an opportunity to examine the evolution and diversity of splicing and gene expression patterns.
Furthermore, the dataset allows the genetic basis of variations for the major molecular phenotypes to be examined, providing information on organismal phenotypes.
The dataset consisted of lymphoblastoid cell lines from over 700 individuals belonging to 26 populations and five continental groups, namely African, European, Mixed American, East Asian, and South Asian ancestries. This wide representation of multiple ancestries in the dataset also addressed the poor representation of diverse ancestries in previous molecular association studies.
The scale of the MAGE dataset also resulted in the high-resolution identification of numerous new genetic associations, causal genetic variants, and their action mechanisms.
The study found that expression QTL effect sizes were consistent across the various populations, indicating that the global ancestry-driven between-species genetic effects do not strongly impact the causal variants within species.
The researchers identified over 15,000 and 16,000 putative causal expression QTLs and splicing QTLs, respectively, that showed epigenomic signature enrichments.
Of these, 1,310 expression QTLs and 1,657 splicing QTLs were from hitherto under-represented populations.
Conclusions
Overall, the study provided a large, open-access RNA sequence database from a study population of diverse ancestries. This database can be used along with other genome-wide association data to explore patterns of splicing and gene expression.
The database provides a method to expand the current understanding of the evolution and diversity of human gene expression. It is a useful tool for exploring the function and evolution of human genomes.
Journal reference:
-
Taylor, D. J., Chhetri, S. B., Tassia, M. G., Biddanda, A., Yan, S. M., Wojcik, G. L., Battle, A., & McCoy, R. C. (2024). Sources of gene expression variation in a globally diverse human cohort. Nature. doi:10.1038/s41586024077082. https://www.nature.com/articles/s41586-024-07708-2