1.Sequence features generation

All proteins were described using 3 blocks of attributes which were calculated from commonly used structural and physicochemical properties of amino acid sequence.

Block1: Amino acid property distance.

Two physicochemical property distances between substituted amino acid and wild type of amino acid were used here, as Schneider-Wrede physicochemical distance matrix and Grantham chemical distance matrix..

Block 2: Single amino acid and dipeptide composition.

Single amino acid composition is described as below:

              (1)

where Ni is the number of amino acid of type i, and N is the length of the sequence. A total of 20 features are calculated here.

Dipeptide composition is defined as:

        (2)

where Nij is the number of dipeptides composed of amino acid type i and j. 400 features are computed here.

Block 3: Geary autocorrelation.

Geary autocorrelation features are defined as:

       (3)

where d=1, 2, 3, … , 30 is the lag of the autocorrelation, Pi and Pi+d are the property of the amino acid at position i and i+d respectively.  is the average value of a particular property of the 20 amino acids.

For this block, 8 amino acid properties are adopted here, they are (1) Hydrophobicity scales, (2)Average flexibility indices, (3) Polarizability parameter, (4)Free energy of solution in water, (5) Residue accessible surface area in trepeptide, (6)Residue volume, (7)Steric Parameter, (8) Relative mutability.

For each type of amino acid property, there are 30 auto-correlation features. This block presents a total of 240 features.

Block 2 and 3 were inspired from the PROFEAT method. Every amino acid property or property distance in Block 1 and Block 3 were centralized and standardized before calculation, i.e.  where  is the average value of a particular property of the 20 amino acids, and . Finally, 662 sequence features were gathered.

 

2. Selected features used in SeqSubPred

 

Block

Feature ID

Rank

1

MutDis2

1

Distance of physicochemical properties

MutDis1

2

2

MG

4

Single peptide and dipeptide composition

Q

5

S

11

YM

13

DE

15

RV

18

AP

19

SS

23

MF

27

SD

33

IG

42

3

GAuto_6_26

3

Geary autocorrelation

GAuto_6_13

6

(Note: for GAuto_i_d, here i means the ith amino acid property (i=1,2,3,...8) and d indicates the lag number )

GAuto_6_20

7

GAuto_6_17

8

GAuto_6_29

9

GAuto_6_25

10

GAuto_6_28

12

GAuto_6_11

14

GAuto_6_23

16

GAuto_6_2

17

GAuto_3_17

20

GAuto_3_20

21

GAuto_6_19

22

GAuto_6_10

24

GAuto_6_1

25

GAuto_3_11

26

GAuto_6_8

28

GAuto_3_25

29

GAuto_3_26

30

GAuto_6_16

31

GAuto_1_16

32

GAuto_2_4

34

GAuto_6_14

GAuto_6_22

35

36

GAuto_1_2

37

GAuto_4_26

38

GAuto_2_27

39

GAuto_3_2

40

GAuto_6_12

41

GAuto_6_7

43

GAuto_1_19

44

 

 

3. Classification method

Random Forest was implemented using R package randomForest v4.5