Model data Model framework Model performance

1. Model data

1) Target prediction dataset

target prediction dataset consisting of 646498 small molecules interacting with 640 targets, for details of the dataset, click here.

2) Bioactivity prediction dataset

Target	Training set	Valuation set	Test set	Total
11β-HSD1	10074	1260	1257	12591
AKT1	6760	845	842	8447
AKT2	2834	354	351	3539
AKT3	2192	275	271	2738
ALK	8337	1042	1038	10417
ACE	1878	235	231	2344
AURKA	10080	1259	1257	12596
AURKB	5615	702	697	7014
BCL2	2289	286	283	2858
BRAF	7179	899	893	8971
BRD4	10235	1280	1276	12791
BTK	14095	1762	1758	17615
CCR2	7412	925	924	9261
CDK1	2691	336	332	3359
CDK2	3297	413	409	4119
CHK1	7328	917	912	9157
CB1R	8379	1046	1045	10470
Cathepsin B	2812	351	348	3511
Cathepsin K	3840	479	478	4797
Cathepsin S	5524	689	688	6901
DPP-4	10229	1278	1276	12783
EGFR	18787	2348	2347	23482
FGFR1	7105	888	885	8878
FGFR2	3142	392	389	3923
FGFR3	6170	770	765	7705
FLT3	7381	922	920	9223
HDAC1	11111	1389	1385	13885
HDAC2	3874	484	481	4839
HDAC6	8290	1036	1033	10359
IGF1R	9160	1144	1143	11447
JAK1	13713	1714	1710	17137
JAK2	19024	2379	2374	23777
JAK3	13294	1661	1658	16613
MEK1	4019	503	499	5021
MMP-2	8199	1026	1021	10246
MMP-3	5090	635	632	6357
MMP-9	6801	852	847	8500
MMP-13	10697	1336	1333	13366
MMP-14	1766	222	216	2204
MR	1811	226	222	2259
NAMPT	6611	825	822	8258
PDGFR-α	3169	397	393	3959
PDGFR-β	9904	1239	1235	12378
PI3K-α	19388	2425	2420	24233
PI3K-β	7879	985	982	9846
PI3K-δ	15129	1890	1887	18906
PI3K-γ	14943	1867	1865	18675
PKC-θ	4462	557	554	5573
Renin	9346	1169	1165	11680
SYK	16779	2098	2094	20971
TNF-α	1777	223	217	2217
VEGFR1	5139	641	639	6419
VEGFR2	27342	3420	3414	34176
VEGFR3	1984	249	244	2477
ZAP70	2616	326	323	3265

3) ADMET prediction dataset

Endpoint	Training set	Valuation set	Test set	Total
Log D7.4	2940	420	840	4200
Log S	6987	998	1997	9982
HIA	403	58	117	578
Pgp inhibitor	851	122	245	1218
BBB Penetration	1422	202	406	2030
CYP2C9 inhibitor	8464	1209	2419	12092
CYP2D6 inhibitor	9191	1313	2626	13130
CYP3A4 inhibitor	8628	1233	2467	12328
CYP2C19 inhibitor	8865	1266	2534	12665
CYP1A2 inhibitor	8805	1258	2516	12579
CYP2D6 substrate	465	67	135	667
CYP3A4 substrate	468	67	135	670
T1/2	465	67	135	667
Tox21	6265	783	783	7831

2. Model framework

1) DMGP framework for target prediction

To implement the target prediction function for small molecular compounds, we constructed a Double Molecular Graph Perception (DMGP) framework using TrimNet and DMPNN, which combines the predictive results of the two algorithms to rank the probable targets of the query molecule. Firstly, we designed a multi-task binary classification model using TrimNet to learn the effect of a compound on multiple targets (positive or negative ligand compound). TrimNet is a graph-based approach with few parameters and high prediction accuracy recently proposed by our research group, which adopts a novel triplet message mechanism to effectively learn molecular representations. When a molecule is input, the output form of TrimNet is a 640-dimensional 0~1 probability vector corresponding to 640 targets, and each dimension vector represents the probability of the query molecule to become a positive molecule for the corresponding target. DMPNN model, as another branch of the DMGP framework, was used to estimate the high dimensional similarity of the query molecule to 640 target positive molecules. When a molecule is input, the output form of DMPNN is also a 640-dimensional 0~1 probability vector , and the sum of each element in the vector is 1. Finally, by elementwise multiplication of vector and , we obtained a 640-dimensional 0~1 relevance score vector , and the 640 elements in the vector represent the final relevance scores of the query molecule to the 640 targets, respectively. When the relevance score corresponding to a target is greater, the target is more probably to be the target of the query molecule, and the workflow of DMGP framework is shown in Figure 1.

Figure 1. Workflow of DMGP framework for target prediction

2) MSAP framework for bioactivity prediction

To implement the bioactivity prediction function for small molecular compounds, we developed a Multi-model Self-validation Activity Prediction (MSAP) framework consisting of 7 ML regression models, including 4 graph-based deep learning models, Message Passing Neural Network (MPNN), Directed Message Passing Neural Network (DMPNN), Graph Attention Network (GAT), Graph Isomorphism Network (GIN) and 3 traditional ML models based on molecular fingerprinting, namely Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). We randomly divided the structure-activity dataset of small-molecule inhibitors of each target into training, validation and testing sets in the ratio of 8:1:1 by stratified sampling of the activity data, and trained, validated and tested the MSAP framework. For each query molecule, after selecting a target of interest, we provide it with two prediction modes, namely Best-mode and Merge-mode. Based on the performance of MSAP framework on the test set, the Best-mode is to select the best performing model to predict the pIC₅₀ value of the query molecule, while the Merge-mode selects several models in the framework whose performance meets the established criteria to predict the pIC₅₀ value of query molecule at the same time. And after excluding abnormal predicted values, take the average of the predicted values of multiple models as the final predicted pIC₅₀ value of the query molecule in Merge-mode. The workflow of small molecules bioactivity prediction is shown in Figure 2.

Figure 2. Workflow of bioactivity prediction

3) Model for ADMET prediction

For ADME-related endpoints, many open-source reliable and high-performing computational methods have been developed. Among them, Therapeutics Data Commons (TDC) developed by Huang et al. integrates several machine learning datasets and tasks related to drug development, contributing to accelerating the development, validation, and transition of machine learning models to clinical implementation. In order to achieve accurate and fast prediction of ADME-related properties of compounds, we used the molecular characterization approach in TDC and suitable machine learning models to model the ADME-related datasets and obtained models with good predictive performance. The models used on the ADME-related datasets and the validation methods are shown in Table 1. In addition, since our previously developed TrimNet has comparable performance with state-of-the-art models for toxicity prediction, we used TrimNet for modeling the toxicity-related dataset.

Table 1. Information about the model used for the ADMET endpoint dataset

Dataset	Model	Validation	Reference
Log D7.4	ContextPred	5-fold cv.	ArXiv preprint arXiv:2102.09548, 2021.
Log S	AttentiveFP
HIA	AttrMasking
Pgp inhibitor	AttrMasking
BBB Penetration	ContextPred
CYP2C9 inhibitor	ContextPred
CYP2D6 inhibitor	ContextPred
CYP3A4 inhibitor	ContextPred
CYP2C19 inhibitor	ContextPred
CYP1A2 inhibitor	ContextPred
CYP2D6 substrate	ContextPred
CYP3A4 substrate	CNN
T_1/2	AttrMasking
Tox21	TrimNet	Train/validation early stopping	Briefings in Bioinformatics, 2021, 22(4): bbaa266.

3. Model performance

1) Model performance of target prediction

Table 2. Performance of DMGP framework on the external validation dataset

Hit target number	top-K accuracy (%), K=
Hit target number	1	5	10	15
1	55.8	84.1	90.7	93
2	-	67.9	77.4	81.7

2) Model performance of Bioactivity Prediction

Table 3. MAE of MSAP framework on structure-activity datasets of small-molecule inhibitors of 56 disease-related targets

Target	Model
Target	RF	SVM	XGBoost	DMPNN	MPNN
11β-HSD1	0.398	0.435	0.449	0.43	0.452
AKT1	0.454	0.453	0.457	0.466	0.565
AKT2	0.325	0.353	0.378	0.355	0.457
AKT3	0.158	0.185	0.196	0.176	0.386
ALK	0.282	0.309	0.337	0.326	0.361
ACE	0.699	-	0.74	0.831	-
AURKA	0.339	0.389	0.395	0.406	0.424
AURKB	0.397	0.426	0.459	-	-
BCL2	0.296	0.304	0.327	0.328	0.465
BRAF	0.32	0.342	0.374	0.366	0.396
BRD4	0.266	0.283	0.298	0.288	0.347
BTK	0.393	0.429	0.443	0.439	-
CCR2	0.314	0.341	0.368	0.387	-
c-Src	0.343	0.387	0.395	0.362	0.439
CDK1	0.295	0.336	0.35	0.332	-
CDK2	0.508	0.568	0.553	-	-
CHK1	0.487	-	0.547	0.539	-
CB1R	0.3	0.371	0.371	0.385	0.445
Cathepsin B	0.346	0.415	0.393	0.404	-
Cathepsin K	0.355	0.396	0.407	0.384	0.522
Cathepsin S	0.324	0.371	0.358	0.368	0.452
DPP-4	0.447	0.5	0.488	0.475	0.532
EGFR	0.405	0.481	0.479	0.457	0.492
FGFR1	0.323	0.335	0.345	0.33	0.403
FGFR2	0.2	0.224	0.25	0.24	0.347
FGFR3	0.192	0.234	0.247	0.239	0.36
FLT3	0.322	0.361	0.38	0.357	0.421
HDAC1	0.388	0.447	0.439	0.405	0.473
HDAC2	0.388	0.464	0.45	0.409	0.503
HDAC6	0.369	0.422	0.404	0.362	0.414
IGF1R	0.277	0.314	0.315	0.328	0.374
JAK1	0.399	0.417	0.423	0.411	0.464
JAK2	0.364	0.396	0.411	0.392	0.428
JAK3	0.338	0.396	0.394	0.399	0.437
MEK1	0.243	0.29	0.286	0.302	0.377
MMP-2	0.484	0.601	0.505	0.505	0.63
MMP-3	0.404	0.433	0.443	0.443	0.542
MMP-9	0.418	0.485	0.447	0.429	0.585
MMP-13	0.489	0.567	0.549	0.581	0.61
MMP-14	0.454	-	0.447	0.455	-
MR	0.365	-	0.387	0.395	-
NAMPT	0.327	0.363	0.334	0.35	0.417
PDGFR-α	0.229	0.312	0.291	0.326	0.366
PDGFR-β	0.297	0.353	0.368	0.344	0.389
PI3K-α	0.338	0.373	0.383	0.379	0.401
PI3K-β	0.368	0.398	0.407	0.416	0.457
PI3K-δ	0.352	0.378	0.39	0.378	0.423
PI3K-γ	0.302	0.333	0.335	0.338	0.368
PKC-θ	0.346	0.378	0.378	-	-
Renin	0.533	0.579	0.592	0.583	-
SYK	0.282	0.306	0.33	0.31	0.336
TNF-α	0.159	0.167	0.19	0.153	-
VEGFR1	0.214	0.252	0.257	0.248	0.304
VEGFR2	0.331	-	0.403	0.369	0.392
VEGFR3	0.236	0.267	0.295	0.323	0.359
ZAP70	0.197	0.252	0.255	0.243	0.407

3) Model performance of ADMET prediction

Table 4. Performance of ADMET prediction models

Property	Task type	AUROC	ACC	SE	SP
HIA	classification	0.976	0.945	0.96	0.896
Pgp inhibitor	classification	0.929	0.846	0.893	0.797
BBB Penetration	classification	0.897	0.871	0.927	0.631
CYP1A2 inhibitor	classification	0.948	0.878	0.861	0.893
CYP2C9 inhibitor	classification	0.919	0.852	0.792	0.88
CYP2C19 inhibitor	classification	0.932	0.86	0.884	0.84
CYP2D6 inhibitor	classification	0.907	0.895	0.635	0.948
CYP2D6 substrate	classification	0.848	0.787	0.684	0.835
CYP3A4 inhibitor	classification	0.921	0.834	0.85	0.822
CYP3A4 substrate	classification	0.641	0.567	0.509	0.65
T1/2	classification	0.763	0.695	0.749	0.621
Tox21	classification	0.856	0.828	0.738	0.836
-	-	MAE	MSE	RMSE	R2
Log D7.4	regression	0.535	0.471	0.686	0.665
Log S	regression	0.789	1.205	1.097	0.771