### Datasets

In the work, we evaluate the performance of the WELM-SURF model on four datasets: *enzymes*, *ion channels*, *GPCRs* and *nuclear receptors*. They can be downloaded from the KEGG BRITE [7], BRENDA [23], SuperTarget [8] and DrugBank [9] databases and defined as the gold standard datasets by Yamanishi [24]. The number of known drugs for *enzymes, ion channels, GPCRs* and *nuclear receptors* are 445, 210, 233 and 54 and the count of known target protein are 664, 204, 95 and 26. After carefully screening, 5127 drug-target pairs can interact with each other. There are 2926, 1476, 635, and 90 known interactions involving *enzymes*, *ion channels, GPCRs,* and *nuclear receptors*. Therefore, we constructed positive samples for each of the four datasets.

Usually, a bipartite graph was used to represent Drug-target interaction network, where each node represent drug molecules or target protein, and each edge describes true drug-target interactions valeted by biological experiments or other methods. As can be seen from the bipartite graph, the numbers of real drug-target interactions edges are small [25]. Here, we take the enzyme dataset as an example, there are 295,480 connections (445 × 664) in the corresponding bipartite graph, of which only 2926 edges are known drug-target interactions. Therefore, the possible count of negative samples (295480–2926 = 29,255) is significantly larger than the number of positive samples (2926). As a result, this will lead to a bias problem. For addressing this problem, we randomly selected the same number of negative and positive samples. Therefore, the number of negative samples for the *enzyme, ion channel, GPCRs*, and *nuclear receptor* are 2926, 1476, 635, and 90, respectively. In fact, there may be the real drug-target pairs among these negative sample sets. However, take account of the large scale of the bipartite graph, the number of true interaction pairs defined as the negative pairs is very small.

### Feature extraction

#### Drug molecules description

Recently, a number of biological experiments have indicated that drugs with similar chemical structure have similar therapeutic functions. In order to represent drugs as feature vectors, several kinds of descriptors have been designed, such as, molecular substructure fingerprints, constitutional, topological, geometrical and quantum chemical properties. In the paper, the chemical structure of molecular substructure fingerprints was used to represent the drugs as drug feature vectors [15]. Each molecular structure is translated into a fingerprint of a structural by using the molecular fingerprints method. This make it can be defined as an 881 dimensional binary vector and its corresponding bits is 1 or 0.

#### Position specific scoring matrix (PSSM)

Due to proteins are functionally conserved, the prediction performance can be improved by using the evolutionary information of protein sequence. The position-specific scoring matrix (PSSM) contains not only the position information of the protein sequence, but also the evolution information that reflects the conservative function of protein. In the experiment, each protein sequence was converted a *L × 20* PSSM by using Position Specific Iterated BLAST (PSI-BLAST) tool [26], where *L* represents the length of different protein sequences. Therefore, we employed the PSSM for extracting the sequence evolutionary information because of its advantage in the paper. The diagram of PSSM is displayed in Fig. 1.

Where 20 are 20 different amino acids, *P*_{ij} represent the probability that the *i*_{th} amino acid in the sequence is mutated to the *j*_{th} type amino acid during biological evolution. The *P*_{ij} is greater than 0, equal to 0 and less than 0. If the *P*_{ij} is a positive number that indicates the *i*_{th} amino acid can be easily mutated to the j_{th} amino acid. In practice, the larger number of *P*_{ij} means a higher mutation probability. Conversely, if *P*_{ij} is negative number, it means the mutation probability is small, and a smaller *P*_{ij} number indicates more conservative. For using evolutionary information of protein sequences to capture more key features, we converted each protein sequence into a PSSM through employing PSI-BLAST tool. In the experiment, we set the parameter of PSI_BLAST’s e-value is 0.001 and selected three iterations for obtaining widely and highly homologous sequences.

#### Speed up robot features (SURF)

Speed up robot features (SURF) [27] feature extraction algorithm is an improvement of Scale Invariant Feature Transform (SIFT) algorithm [28, 29], which runs faster than SIFT in algorithm execution efficiency. The SIFT uses Gaussian differences to approximate Laplace Gauss distribution to find scale space. However, the SURF uses Box Filter to approximate LOG. The major advantage of SURF is that it is easier to calculate the convolution with the box filter by using the integrated image, which can be done in parallel at different scales. The execution of the SURF algorithm depends on the determinant of the Hessian matrix and the determinant of the position. The SURF algorithm includes the following two steps: feature point detection and feature adjacent description.

### Feature point detection

The SURF uses continuous Gaussian filters of different scales to process image and detects feature points of mesoscale invariant through Gaussian differences. SURF can represent Gaussian fuzzy approximation by using the square filter to replace the Gaussian filters of SIFT. The SURF feature extraction approach can convert a image into sets of vectors *I*_{J} ∈ *R*^{d}, *j* = 1, …, *N,* where N is a number of images and *I*_{j} = {*f*_{1}, *f*_{2}, …*f*_{m}} and \( {f}_m=\left\{{f}_m^1,{f}_m^2,\dots .{f}_m^d\right\} \)*,* where m is a number of local features in each image and d is the SURF features dimension that is equal to 64. The f_{m} represent the SURF local features, note that the m values are not same in all images. We want to organize *I*_{j} into K clusters *c* = {*c*_{1}, *c*_{2}, …*c*_{k}}*.* The similarity criterion then is defined as follow equation:

$$ S\left(x,y\right)=\sum \limits_{i=1}^k\sum \limits_{j=1}^n{a}_i^j sim\left({I}_j,{m}_j\right) $$

Where \( x=\left\{{a}_i^j\right\} \) *i*s separation matrix, \( {\mathrm{a}}_{\mathrm{i}}^{\mathrm{j}}=\left\{\begin{array}{c}1,\kern0.5em \mathrm{if}\ {\mathrm{a}}_{\mathrm{i}}^{\mathrm{j}}\in \mathrm{clusters}\\ {}0,\kern0.5em \mathrm{otherwise}\end{array}\right. \) with \( {\sum}_{i=1}^k{\sum}_{j=1}^n{a}_i^j=1\forall j;y \) *= { m*_{1}, …, *m*_{k} *}, sim*(*I*_{j}, *m*_{j}) represents how the correspondent features can be calculated between the two sets of local features.

The square filter can greatly improve the computation speed through using integral graph that only calculates the value the four corners of the square filter. The determinant value of hessian matrix represents the change around pixel points. Since SURF USES hessian matrix of spot detection to identify feature point whose value should be defined as the maximum or minimum value of determinant. In addition, in order to achieve scale invariance, SURF also USES the determinant of scale σ to carry out detection of feature point. For example, given a point p = (x, y) in the graph, the Hessian matrix of scale σ is can be represented as follows:

$$ H\left(p,\sigma \right)=\left(\begin{array}{c}{L}_{xx}\left(p,\sigma \right)\kern1.5em {L}_{xy}\left(p,\sigma \right)\\ {}{L}_{xy}\left(p,\sigma \right)\kern1.5em {L}_{yy}\left(p,\sigma \right)\end{array}\right) $$

Where the L_{xx}(p, σ) , L_{xy}(p, σ), L_{xy}(p, σ) and L_{yy}(p, σ) are the gray-order image after the second order differentiation. The SCALE of SURF isn’t continuous Gaussian ambiguity and down sampling processing. On the contrary, it is determined by the size of square filters. The lowest scale (initial scale) of square filter of is 9 × 9, which is approximately σ =1.2 Gaussian filter. The size of the upper scale filter will get larger and larger, such as 15 × 15, 21 × 21, 27 × 27…

The transformation formula of its scale is as follows:

$$ {\sigma}_{approx}= Currentfiltersize\times \left(\frac{BaseFilterscale}{BaseFilterSize}\right) $$

### Feature adjacent description

The descriptor of SURF uses the concept of Hal wavelet transform. In order to ensure the rotation invariance of feature point, each feature point is assigned a direction. The SURF descriptors calculate the Hal wavelet transform of 6σ pixels of direction of *X* and *Y* around feature point. A vector can be obtained by add components of corresponding *X* and *Y* of the wavelet in each interval. The longest (the largest *X* and Y components) of all vectors is the direction of the feature point. After the direction of the feature point is selected, the descriptor of feature point can be created by using the direction of surrounding pixels. For example, the 5 × 5 pixel points were defined as a sub region. As a result, a number of 16 sub regions can be generated by extracting the range of 20*20 pixel points around the feature point and the ∑*dx* and ∑ *dy* of the Hal wavelet transform in the *X* and *Y* directions within the sub region can be calculated. Finally, a feature vector with dimensional 64 can be generated.

In the experiment, we used SURF method to create feature vectors whose dimensional is 64. Figure 2 shows the flow diagram of our method.

### Weighted extreme learning machine (WELM)

Zong et al [30] proposed a Weighted Extreme Learning Machine (WELM) based on Extreme Learning Machine (ELM). In order to efficiently predict DTIs, we build the WELM model based on ELM. The network structure of ELM is as follows (Fig. 3):

Assuming there are *n* training samples \( {\left\{{x}_i,{t}_i\right\}}_{i=1}^n \), where *x*_{i} = {*x*_{i1}, *x*_{i2}, *x*_{i3}, ……*x*_{in}}^{T} ∈ *R*^{n}, *t*_{i} = {*t*_{i1}, *t*_{i2}, *t*_{i3}, ……*t*_{in}}^{T} ∈ *R*^{m}, *n* represents the number of sample and *m* is the classification number. The output model of feedforward neural network with *L* hidden layer nodes can be expressed as follows:

$$ {\sum}_{h=1}^L{\beta}_hG\left({a}_h,{b}_h,x\right)={o}_i,i=1,2,3,\dots \dots, N $$

(5)

Where *β*_{h} is the output weight of the **h**_{th} hidden layer neuron, *G* represents activation function of hidden layer neuron, *a*_{h} and *b*_{h} is defined as the input weight and biases of hidden layer neuron, *x* is input samples, *o*_{i} represents the actual output value of *i*_{th} training sample, *t*_{i} is the expected output of *i*_{th} training sample. According to the literature [15], there are *N* training samples \( {\left\{{x}_i,{t}_i\right\}}_{i=1}^n \), *x*_{i} ∈ *R*^{n}. There are (*a*_{h}, *b*_{h}) and *β*_{h}, which make \( {\sum}_{i=1}^L\left|\left|{o}_i-{t}_i\right|\right|=0 \) and single-hidden layer feedforward network (SLFN) can approach the training set \( {\left\{{x}_i,{t}_i\right\}}_{i=1}^n \), *x*_{i} ∈ *R*^{n} with zero error. The eq. 1 can be simplified as follow:

Where *H* and *β* are the output matrix and the output weight matrix of the hidden layer respectively and *T* is the expected output matrix corresponding training samples. The output weight of the hidden layer can be expressed as follow:

$$ \hat{\beta}=\left\{\begin{array}{c}{H}^T{\left(\frac{I}{C}+H{H}^T\right)}^{-1}T,N<L\\ {}{\left(\frac{I}{C}+{H}^TH\right)}^{-1}{H}^TT,N\ge L\end{array}\right\} $$

(7)

The output function of ELM can be defined as follow:

$$ f(x)=h(x)\hat{\beta}=\left\{\begin{array}{c}h(x){H}^T{\left(\frac{I}{C}+H{H}^T\right)}^{-1}T,N<L\\ {}h(x){\left(\frac{I}{C}+{H}^TH\right)}^{-1}{H}^TT,N\ge L\end{array}\right\} $$

(8)

WELM has two weighting strategies [31], one is automatic weighting and can be defined as follow:

$$ {w}_1=\frac{1}{Count\left({t}_i\right)} $$

(9)

Where *Count*(*t*_{i}) represents the number of class *t* in the training sample. The other sacrifices the classification accuracy of the majority class for obtaining the classification accuracy of the minority class. This splits the minority class and the majority class into 0.618: 1(golden ratio) and is defined as follow:

$$ {w}_2=\left\{\begin{array}{c}\frac{0.618}{Count\left({t}_i\right)},{t}_i\in majority\ class\\ {}\frac{1}{Count\left({t}_i\right)},{t}_i\in minority\ class\end{array}\right\} $$

(10)

The output weight of WELM hidden layer can be represented as follow:

$$ \hat{\beta}={H}^{-}T\left\{\begin{array}{c}{H}^T{\left(\frac{I}{C}+ WH{H}^T\right)}^{-1} WT,N<L\\ {}{\left(\frac{I}{C}+{H}^T WH\right)}^{-1}{H}^T WT,N\ge L\end{array}\right\} $$

(11)

Where the weighting matrix is a *N* × *N* diagonal matrix, and the *N* diagonal elements correspond to *N* samples. Different weights are assigned to different sample classes, and the weighting weights of the same class are the same.

The WELM has the advantage of short training time and good generalization ability and can efficiently execute classification by optimizing the loss function of weight matrix. As a result, the WELM classifier was used to predict DTIs by employing the automatic weighting strategy. The prediction flow diagram of WELM-SURF model is shown in Fig. 4.

### Performance evaluation

The following measures were used to evaleeuate the prediction performance of WELM-SURF in the work.

$$ \mathrm{Acc}=\frac{TP+ TN}{TP+ FP+ TN+ FN} $$

(12)

$$ \mathrm{TPR}=\frac{TP}{TP+ TN} $$

(13)

$$ \mathrm{PPV}=\frac{TP}{FP+ TP} $$

(14)

$$ \mathrm{MCC}=\frac{\left( TP\times TN\right)-\left( FP\times FN\right)}{\sqrt{\left( TP+ FN\right)\times \left( TN+ FP\right)\times \left( TP+ FP\right)\times \left( TN+ FN\right)}} $$

(15)

Where Acc represents Accuracy, TPR is Sensitivity, PPV is Precision and MCC represents Matthews’s correlation coefficient. TP and TN represent the count of real interaction and real non-interaction protein sequence pairs correctly predicted. FP and FN is the number of real non-interaction and real interaction protein sequence pairs mistakenly predicted. Meanwhile, Receiver Operating Curve (ROC) was employed to further assess the prediction performance of WELM-SURF in the work.