1.Sequence features generation
All proteins were described using 3 blocks of attributes which were calculated from commonly used structural and physicochemical properties of amino acid sequence.
Block1: Amino acid property distance.
Two physicochemical property distances between substituted amino acid and wild type of amino acid were used here, as Schneider-Wrede physicochemical distance matrix and Grantham chemical distance matrix..
Block 2: Single amino acid and dipeptide composition.
Single amino acid composition is described as below:
(1)
where Ni is the number of amino acid of type i, and N is the length of the sequence. A total of 20 features are calculated here.
Dipeptide composition is defined as:
(2)
where Nij is the number of dipeptides composed of amino acid type i and j. 400 features are computed here.
Block 3: Geary autocorrelation.
Geary autocorrelation features are defined as:
(3)
where d=1, 2, 3, … , 30 is the lag of the autocorrelation, Pi and Pi+d are the property of the amino acid at position i and i+d respectively. is the average value of a particular property of the 20 amino acids.
For this block, 8 amino acid properties are adopted here, they are (1) Hydrophobicity scales, (2)Average flexibility indices, (3) Polarizability parameter, (4)Free energy of solution in water, (5) Residue accessible surface area in trepeptide, (6)Residue volume, (7)Steric Parameter, (8) Relative mutability.
For each type of amino acid property, there are 30 auto-correlation features. This block presents a total of 240 features.
Block 2 and 3 were inspired from the PROFEAT method. Every amino acid property or property distance in Block 1 and Block 3 were centralized and standardized before calculation, i.e. where is the average value of a particular property of the 20 amino acids, and . Finally, 662 sequence features were gathered.
2. Selected features used in SeqSubPred
Block |
Feature ID |
Rank |
1 |
MutDis2 |
1 |
Distance of physicochemical properties |
MutDis1 |
2 |
2 |
MG |
4 |
Single peptide and dipeptide composition |
Q |
5 |
S |
11 |
|
YM |
13 |
|
DE |
15 |
|
RV |
18 |
|
AP |
19 |
|
SS |
23 |
|
MF |
27 |
|
SD |
33 |
|
IG |
42 |
|
3 |
GAuto_6_26 |
3 |
Geary autocorrelation |
GAuto_6_13 |
6 |
(Note: for GAuto_i_d, here i means the ith amino acid property (i=1,2,3,...8) and d indicates the lag number ) |
GAuto_6_20 |
7 |
GAuto_6_17 |
8 |
|
GAuto_6_29 |
9 |
|
GAuto_6_25 |
10 |
|
GAuto_6_28 |
12 |
|
GAuto_6_11 |
14 |
|
GAuto_6_23 |
16 |
|
GAuto_6_2 |
17 |
|
GAuto_3_17 |
20 |
|
GAuto_3_20 |
21 |
|
GAuto_6_19 |
22 |
|
GAuto_6_10 |
24 |
|
GAuto_6_1 |
25 |
|
GAuto_3_11 |
26 |
|
GAuto_6_8 |
28 |
|
GAuto_3_25 |
29 |
|
GAuto_3_26 |
30 |
|
GAuto_6_16 |
31 |
|
GAuto_1_16 |
32 |
|
GAuto_2_4 |
34 |
|
GAuto_6_14 GAuto_6_22 |
35 36 |
|
GAuto_1_2 |
37 |
|
GAuto_4_26 |
38 |
|
GAuto_2_27 |
39 |
|
GAuto_3_2 |
40 |
|
GAuto_6_12 |
41 |
|
GAuto_6_7 |
43 |
|
GAuto_1_19 |
44 |
3. Classification method
Random Forest was implemented using R package randomForest v4.5