Precision oncology aims to tailor cancer treatments to individual patients by analyzing diverse clinical and pathological data. However, despite numerous technological advancements, integrating complex multimodal data, such as images and text, remains a challenge.
In a recent study published in Nature, a team of researchers from Stanford University School of Medicine and Harvard Medical School introduced a vision-language foundation model, called the Multimodal Transformer with Unified Masked Modeling or MUSK, that they developed for oncology applications.
They evaluated MUSK across multiple benchmarks and demonstrated its potential to improve diagnosis, prognosis, and treatment predictions, offering a new avenue for enhancing personalized cancer care and addressing the limitations of existing models.
Study: A vision–language foundation model for precision oncology. Image Credit: Gorodenkoff/Shutterstock.com
AI Models in Oncology
Clinical decision-making in oncology often requires integrating multimodal data, including imaging, pathology, and clinical reports. Traditional artificial intelligence (AI) models have struggled to effectively combine such diverse data due to the scarcity of annotated datasets.
Foundation models that have been pre-trained on large datasets have recently shown promise in medical applications and offer the potential for efficient multitasking capabilities.
However, existing models in pathology rely heavily on paired data for pre-training and focus on relatively straightforward tasks such as cancer detection or classification.
The application of AI to more complex tasks, such as treatment response and outcome predictions, where accurate forecasting is critical for personalized therapy, continues to remain challenging.
Moreover, current clinical methods rely on staging and other basic risk factors, which often lack the precision needed for individualized care.
The Current Study
In the present study, the research team developed MUSK, which is a vision-language foundation model that leverages a multimodal transformer architecture for integrating pathology images and clinical text.
MUSK was pre-trained in two phases — the first phase used 50 million pathology image patches and one billion text tokens in a masked modeling approach, while the second phase refined multimodal alignment with one million image-text pairs.
The images used for the training were from histopathological slides covering 33 tumor types from over 11,000 patients, while text data came from pathology reports and articles.
The architecture of MUSK contains components such as independent vision and language modules, which enable efficient processing of each modality. Furthermore, the pre-training utilized unpaired data, which helped overcome the scarcity of annotated datasets.
The model was subsequently fine-tuned for alignment using paired data and contrastive learning to ensure the robust integration of multimodal data.
The researchers evaluated the model using a range of downstream tasks, including image and text retrieval, visual question answering, and molecular biomarker prediction. MUSK was also applied to prognosis and immunotherapy response predictions using multimodal datasets, integrating clinical reports and whole-slide images.
The major datasets for evaluation included benchmark sets for image classification, molecular biomarker identification, and patient outcome predictions.
Key Findings
The study demonstrated that MUSK outperformed existing models in numerous oncology tasks and exhibited the utility of integrating multimodal data.
In cross-modal retrieval, MUSK displayed notable accuracy in text-to-image and image-to-text retrieval tasks and outperformed state-of-the-art models on benchmark datasets. For visual question answering, MUSK’s performance showed improved accuracy over models specifically designed for this task.
Furthermore, in melanoma relapse prediction, MUSK integrated pathology images and clinical reports to achieve high predictive accuracy and surpassed other models in sensitivity and specificity.
Additionally, in pan-cancer prognosis prediction, MUSK consistently outperformed conventional clinical metrics and alternative AI models. It achieved high concordance indices across 16 cancer types, with notable success in renal cell carcinoma and low-grade glioma.
Moreover, MUSK also excelled in predicting molecular biomarkers, such as human epidermal growth factor receptor 2 (HER2) status in breast cancer and isocitrate dehydrogenase or IDH mutations in brain tumors, with significantly higher accuracy than competing methods.
The model also demonstrated improved predictive power over established biomarkers in immunotherapy response prediction by identifying subsets of patients likely to benefit from treatment despite traditionally low response rates.
This was also validated in lung and gastro-esophageal cancer cohorts. Across all tasks, MUSK displayed the advantages of multimodal integration by effectively combining text and image data to provide actionable insights for precision oncology.
Conclusions
Overall, the findings established the efficiency of the novel MUSK model in integrating multimodal data for oncological diagnosis and prognosis. MUSK outperformed existing methods across diverse tasks such as cancer prognosis, molecular biomarker prediction, and immunotherapy response.
Furthermore, by leveraging unpaired pathology images and text data, MUSK highlighted the potential of advanced AI to enhance precision medicine.
While promising, the researchers stated that future validation with larger, diverse datasets is essential to establish the clinical utility of MUSK before adoption in real-world healthcare settings.
Journal reference:
-
Xiang, J., Wang, X., Zhang, X., Xi, Y., Eweje, F., Chen, Y., Li, Y., Bergstrom, C., Gopaulchan, M., Kim, T., Yu, K., Willens, S., Olguin, F. M., Nirschl, J. J., Neal, J., Diehn, M., Yang, S., & Li, R. (2025). A vision–language foundation model for precision oncology. Nature. doi:10.1038/s4158602408378w. https://www.nature.com/articles/s41586-024-08378-w