1. Model data

1) Target prediction dataset

target prediction dataset consisting of 646498 small molecules interacting with 640 targets, for details of the dataset, click here.

2) Bioactivity prediction dataset

Target Training set Valuation set Test set Total
11β-HSD1 10074 1260 1257 12591
AKT1 6760 845 842 8447
AKT2 2834 354 351 3539
AKT3 2192 275 271 2738
ALK 8337 1042 1038 10417
ACE 1878 235 231 2344
AURKA 10080 1259 1257 12596
AURKB 5615 702 697 7014
BCL2 2289 286 283 2858
BRAF 7179 899 893 8971
BRD4 10235 1280 1276 12791
BTK 14095 1762 1758 17615
CCR2 7412 925 924 9261
CDK1 2691 336 332 3359
CDK2 3297 413 409 4119
CHK1 7328 917 912 9157
CB1R 8379 1046 1045 10470
Cathepsin B 2812 351 348 3511
Cathepsin K 3840 479 478 4797
Cathepsin S 5524 689 688 6901
DPP-4 10229 1278 1276 12783
EGFR 18787 2348 2347 23482
FGFR1 7105 888 885 8878
FGFR2 3142 392 389 3923
FGFR3 6170 770 765 7705
FLT3 7381 922 920 9223
HDAC1 11111 1389 1385 13885
HDAC2 3874 484 481 4839
HDAC6 8290 1036 1033 10359
IGF1R 9160 1144 1143 11447
JAK1 13713 1714 1710 17137
JAK2 19024 2379 2374 23777
JAK3 13294 1661 1658 16613
MEK1 4019 503 499 5021
MMP-2 8199 1026 1021 10246
MMP-3 5090 635 632 6357
MMP-9 6801 852 847 8500
MMP-13 10697 1336 1333 13366
MMP-14 1766 222 216 2204
MR 1811 226 222 2259
NAMPT 6611 825 822 8258
PDGFR-α 3169 397 393 3959
PDGFR-β 9904 1239 1235 12378
PI3K-α 19388 2425 2420 24233
PI3K-β 7879 985 982 9846
PI3K-δ 15129 1890 1887 18906
PI3K-γ 14943 1867 1865 18675
PKC-θ 4462 557 554 5573
Renin 9346 1169 1165 11680
SYK 16779 2098 2094 20971
TNF-α 1777 223 217 2217
VEGFR1 5139 641 639 6419
VEGFR2 27342 3420 3414 34176
VEGFR3 1984 249 244 2477
ZAP70 2616 326 323 3265

3) ADMET prediction dataset

Endpoint Training set Valuation set Test set Total
Log D7.4 2940 420 840 4200
Log S 6987 998 1997 9982
HIA 403 58 117 578
Pgp inhibitor 851 122 245 1218
BBB Penetration 1422 202 406 2030
CYP2C9 inhibitor 8464 1209 2419 12092
CYP2D6 inhibitor 9191 1313 2626 13130
CYP3A4 inhibitor 8628 1233 2467 12328
CYP2C19 inhibitor 8865 1266 2534 12665
CYP1A2 inhibitor 8805 1258 2516 12579
CYP2D6 substrate 465 67 135 667
CYP3A4 substrate 468 67 135 670
T1/2 465 67 135 667
Tox21 6265 783 783 7831

2. Model framework

1) DMGP framework for target prediction

To implement the target prediction function for small molecular compounds, we constructed a Double Molecular Graph Perception (DMGP) framework using TrimNet and DMPNN, which combines the predictive results of the two algorithms to rank the probable targets of the query molecule. Firstly, we designed a multi-task binary classification model using TrimNet to learn the effect of a compound on multiple targets (positive or negative ligand compound). TrimNet is a graph-based approach with few parameters and high prediction accuracy recently proposed by our research group, which adopts a novel triplet message mechanism to effectively learn molecular representations. When a molecule is input, the output form of TrimNet is a 640-dimensional 0~1 probability vector corresponding to 640 targets, and each dimension vector represents the probability of the query molecule to become a positive molecule for the corresponding target. DMPNN model, as another branch of the DMGP framework, was used to estimate the high dimensional similarity of the query molecule to 640 target positive molecules. When a molecule is input, the output form of DMPNN is also a 640-dimensional 0~1 probability vector , and the sum of each element in the vector is 1. Finally, by elementwise multiplication of vector and , we obtained a 640-dimensional 0~1 relevance score vector , and the 640 elements in the vector represent the final relevance scores of the query molecule to the 640 targets, respectively. When the relevance score corresponding to a target is greater, the target is more probably to be the target of the query molecule, and the workflow of DMGP framework is shown in Figure 1.

...

Figure 1. Workflow of DMGP framework for target prediction

2) MSAP framework for bioactivity prediction

To implement the bioactivity prediction function for small molecular compounds, we developed a Multi-model Self-validation Activity Prediction (MSAP) framework consisting of 7 ML regression models, including 4 graph-based deep learning models, Message Passing Neural Network (MPNN), Directed Message Passing Neural Network (DMPNN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and 3 traditional ML models based on molecular fingerprinting, namely Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). We randomly divided the structure-activity dataset of small-molecule inhibitors of each target into training, validation and testing sets in the ratio of 8:1:1 by stratified sampling of the activity data, and trained, validated and tested the MSAP framework. For each query molecule, after selecting a target of interest, we provide it with two prediction modes, namely Best-mode and Merge-mode. Based on the performance of MSAP framework on the test set, the Best-mode is to select the best performing model to predict the pIC50 value of the query molecule, while the Merge-mode selects several models in the framework whose performance meets the established criteria to predict the pIC50 value of query molecule at the same time. And after excluding abnormal predicted values, take the average of the predicted values of multiple models as the final predicted pIC50 value of the query molecule in Merge-mode. The workflow of small molecules bioactivity prediction is shown in Figure 2.

...

Figure 2. Workflow of bioactivity prediction

3) Model for ADMET prediction

For ADME-related endpoints, many open-source reliable and high-performing computational methods have been developed. Among them, Therapeutics Data Commons (TDC) developed by Huang et al. integrates several machine learning datasets and tasks related to drug development, contributing to accelerating the development, validation, and transition of machine learning models to clinical implementation. In order to achieve accurate and fast prediction of ADME-related properties of compounds, we used the molecular characterization approach in TDC and suitable machine learning models to model the ADME-related datasets and obtained models with good predictive performance. The models used on the ADME-related datasets and the validation methods are shown in Table 1. In addition, since our previously developed TrimNet has comparable performance with state-of-the-art models for toxicity prediction, we used TrimNet for modeling the toxicity-related dataset.

Table 1. Information about the model used for the ADMET endpoint dataset

Dataset Model Validation Reference
Log D7.4 ContextPred 5-fold cv. ArXiv preprint arXiv:2102.09548, 2021.
Log S AttentiveFP
HIA AttrMasking
Pgp inhibitor AttrMasking
BBB Penetration ContextPred
CYP2C9 inhibitor ContextPred
CYP2D6 inhibitor ContextPred
CYP3A4 inhibitor ContextPred
CYP2C19 inhibitor ContextPred
CYP1A2 inhibitor ContextPred
CYP2D6 substrate ContextPred
CYP3A4 substrate CNN
T1/2 AttrMasking
Tox21 TrimNet Train/validation early stopping Briefings in Bioinformatics, 2021, 22(4): bbaa266.

3. Model performance

1) Model performance of target prediction

Table 2. Performance of DMGP framework on the external validation dataset

Hit target number top-K accuracy (%), K=
1 5 10 15
1 55.8 84.1 90.7 93
2 - 67.9 77.4 81.7

2) Model performance of Bioactivity Prediction

Table 3. MAE of MSAP framework on structure-activity datasets of small-molecule inhibitors of 56 disease-related targets

Target Model
RF SVM XGBoost DMPNN MPNN
11β-HSD1 0.398 0.435 0.449 0.43 0.452
AKT1 0.454 0.453 0.457 0.466 0.565
AKT2 0.325 0.353 0.378 0.355 0.457
AKT3 0.158 0.185 0.196 0.176 0.386
ALK 0.282 0.309 0.337 0.326 0.361
ACE 0.699 - 0.74 0.831 -
AURKA 0.339 0.389 0.395 0.406 0.424
AURKB 0.397 0.426 0.459 - -
BCL2 0.296 0.304 0.327 0.328 0.465
BRAF 0.32 0.342 0.374 0.366 0.396
BRD4 0.266 0.283 0.298 0.288 0.347
BTK 0.393 0.429 0.443 0.439 -
CCR2 0.314 0.341 0.368 0.387 -
c-Src 0.343 0.387 0.395 0.362 0.439
CDK1 0.295 0.336 0.35 0.332 -
CDK2 0.508 0.568 0.553 - -
CHK1 0.487 - 0.547 0.539 -
CB1R 0.3 0.371 0.371 0.385 0.445
Cathepsin B 0.346 0.415 0.393 0.404 -
Cathepsin K 0.355 0.396 0.407 0.384 0.522
Cathepsin S 0.324 0.371 0.358 0.368 0.452
DPP-4 0.447 0.5 0.488 0.475 0.532
EGFR 0.405 0.481 0.479 0.457 0.492
FGFR1 0.323 0.335 0.345 0.33 0.403
FGFR2 0.2 0.224 0.25 0.24 0.347
FGFR3 0.192 0.234 0.247 0.239 0.36
FLT3 0.322 0.361 0.38 0.357 0.421
HDAC1 0.388 0.447 0.439 0.405 0.473
HDAC2 0.388 0.464 0.45 0.409 0.503
HDAC6 0.369 0.422 0.404 0.362 0.414
IGF1R 0.277 0.314 0.315 0.328 0.374
JAK1 0.399 0.417 0.423 0.411 0.464
JAK2 0.364 0.396 0.411 0.392 0.428
JAK3 0.338 0.396 0.394 0.399 0.437
MEK1 0.243 0.29 0.286 0.302 0.377
MMP-2 0.484 0.601 0.505 0.505 0.63
MMP-3 0.404 0.433 0.443 0.443 0.542
MMP-9 0.418 0.485 0.447 0.429 0.585
MMP-13 0.489 0.567 0.549 0.581 0.61
MMP-14 0.454 - 0.447 0.455 -
MR 0.365 - 0.387 0.395 -
NAMPT 0.327 0.363 0.334 0.35 0.417
PDGFR-α 0.229 0.312 0.291 0.326 0.366
PDGFR-β 0.297 0.353 0.368 0.344 0.389
PI3K-α 0.338 0.373 0.383 0.379 0.401
PI3K-β 0.368 0.398 0.407 0.416 0.457
PI3K-δ 0.352 0.378 0.39 0.378 0.423
PI3K-γ 0.302 0.333 0.335 0.338 0.368
PKC-θ 0.346 0.378 0.378 - -
Renin 0.533 0.579 0.592 0.583 -
SYK 0.282 0.306 0.33 0.31 0.336
TNF-α 0.159 0.167 0.19 0.153 -
VEGFR1 0.214 0.252 0.257 0.248 0.304
VEGFR2 0.331 - 0.403 0.369 0.392
VEGFR3 0.236 0.267 0.295 0.323 0.359
ZAP70 0.197 0.252 0.255 0.243 0.407

3) Model performance of ADMET prediction

Table 4. Performance of ADMET prediction models

Property Task type AUROC ACC SE SP
HIAclassification0.9760.9450.960.896
Pgp inhibitorclassification0.9290.8460.8930.797
BBB Penetrationclassification0.8970.8710.9270.631
CYP1A2 inhibitorclassification0.9480.8780.8610.893
CYP2C9 inhibitorclassification0.9190.8520.7920.88
CYP2C19 inhibitorclassification0.9320.860.8840.84
CYP2D6 inhibitorclassification0.9070.8950.6350.948
CYP2D6 substrateclassification0.8480.7870.6840.835
CYP3A4 inhibitorclassification0.9210.8340.850.822
CYP3A4 substrateclassification0.6410.5670.5090.65
T1/2classification0.7630.6950.7490.621
Tox21classification0.8560.8280.7380.836
--MAEMSERMSER2
Log D7.4regression0.5350.4710.6860.665
Log Sregression0.7891.2051.0970.771