Modern drug discovery methods rely on predicting the binding affinities or biochemical activity of potential drugs against the target proteins. However, with the development of modern, data-driven methods of predicting compound activity using artificial intelligence (AI), a compound activity prediction benchmark that can evaluate these AI-based methods for real-world applications is needed.
In a recent study published in Communications Chemistry, researchers presented a compound activity prediction benchmark they curated for real-world, practical applications called Compound Activity benchmark for Real-world Applications (CARA).
Study: Benchmarking compound activity prediction for real-world drug discovery applications. Image Credit: paulista/Shutterstock.com
Background
Although predicting the binding affinities of potential drug compounds against a target protein is an essential step of the modern drug discovery process, the drug discovery pipeline consists of numerous steps to characterize and predict the activity of the compounds and optimize the drugs.
Modern, data-driven methods of predicting compound activity, such as AI, deep learning, and machine learning, are more efficient and accurate than traditional knowledge-based methods, such as computer-aided drug design.
The success of data-driven methods in predicting binding affinities and compound activity is dependent on understanding the compound activity pattern from high-quality, large-scale data.
Furthermore, compound activity is measured using various cell-based experiments and biochemical and biophysical methods, making obtaining large-scale data challenging.
However, despite the availability of various large-scale benchmark datasets on compound activity, a benchmark designed to evaluate these data-based methods from a real-world perspective is lacking.
About the Study
In the present study, the researchers curated a benchmark called CARA based on real-world data characteristics to predict compound activity for practical applications.
To develop CARA, the researchers first analyzed the compound activity data from existing drug-discovery processes in the ChEMBL database.
The activity data in ChEMBL is grouped according to assays, where the measurement conditions for the same protein target but different compounds are cataloged together.
The researchers first filtered the ChEMBL data to retain single protein targets and small-molecule ligands below 1,000 molecular weights. They also removed samples that were not annotated well and had missing values.
The samples were then arranged according to individual measurement types, and replicates were combined with median values for reporting final measurements. The compound activity data was then differentiated into virtual screening and lead optimization categories.
The virtual screening process increases efficiency and success rates while lowering the experimental screening costs. The lead optimization stage is needed to ensure that the candidate compounds will be effective in the clinical experiments.
The assays were divided into training and test sets, and those with varied protein targets were used as test sets to evaluate the different models of compound activity prediction. Training and test sets were defined for both virtual screening and lead optimization tasks.
The data splitting also considered two scenarios to consider different application settings. One scenario in which no data on tasks was available was called the zero-shot scenario, and another in which measurements for some samples were available was called the few-shot scenario.
A range of deep learning and machine learning methods and training strategies to predict compound activity were evaluated using CARA. These included DeepCPI which used singular value decomposition, DeepDTA based on convolutional neural networks, and GraphDTA, which used graph neural networks.
Major Findings
The findings indicated that CARA could carefully distinguish assay types and select evaluation matrices to assess the bias in the distribution of data on real-world compound activity and prevent the overestimation of model predictions.
The assay-based metrices for evaluation used in CARA provided more accurate and comprehensive results in comparison to the bulk-evaluation metrices.
Testing some few-shot scenarios also revealed that virtual screening strategies were more effective for the exploration of cross-assay information, while those dealing with single-task information were better suited for lead optimization.
The researchers also found that the performance of various deep-learning and machine-learning methods differed across assays, and these methods had limitations in estimating uncertainty and the sample level and predicting activity cliffs.
Conclusions
The study aimed to develop a benchmark for predicting compound activity for drug discovery that could evaluate compound activity prediction from a real-world application perspective.
The findings indicated that CARA provided a high-quality, assay-based, large-scale dataset that could be used to evaluate and develop models for predicting compound activity. The researchers believe that CARA can pave the way to the development of more efficient data-driven drug discovery models.
Journal reference:
-
Tian, T., Li, S., Zhang, Z., Chen, L., Zou, Z., Zhao, D., & Zeng, J. (2024). Benchmarking compound activity prediction for real-world drug discovery applications. Communications Chemistry, 7(1), 127. doi: https://doi.org/10.1038/s42004024012044. https://www.nature.com/articles/s42004-024-01204-4