1 Feature generation
Each site was represented by a 51-residue long peptide centered at the cysteine of interest as described above. If the upstream or downstream residue number is less than 25, symbol X was used to represent the missing amino acid in the sequence fragment. Following this, each fragment was transformed into five categories of features, as discussed below.

 

I. Amino acid composition

Amino acid composition was incorporate via two groups of features: single amino acid composition in a sequence fragment and K-spaced amino acid pair composition. Single amino acid composition is the fraction of each amino acid type in a sequence fragment. K-spaced amino acid pair composition was first proposed by Chen et al. for the prediction of protein flexible/rigid regions, and been proved useful for the prediction of O-glycosylation sites and palmitoylation sites.The K-spaced amino acid pair compositions were calculated by considering the fraction of amino acid pairs that are separated by k amino acids within a protein sequence fragment (there are 441 possible pairs, e.g. AA, AC, AD, ..., XX). We refer to such a feature vector as .Forinstance,

, where NA3C is the number of occurrences of the AC amino acid pair separated by 3 amino acids and N is the residue length of the peptide. In this work, N was set to 51 and k=0,1,...,9 were jointly considered. In total, 4431 compositional features were generated.

 

II. Autocorrelation of amino acid physicochemical properties

Thirteen amino acid properties were utilized for this feature set. Eight of the properties are (1)the hydrophobicity scale, (2)the average flexibility index, (3) the polarizability parameter, (4)the free energy of solution in water, (5)the residue accessible surface area for a tripeptide, (6)the average volumes of residues, (7)the steric parameter and (8)the relative mutability. The remaining five properties were derived from the literature and assembled by Hu et al. as transformed attributes, including (9) the polarity factor, (10) the secondary structure factor, (11) the molecular volume factor, (12) the codon diversity factor and (13)electrostatic charge factor. All of the amino acid are centralized and standardized before the calculation and the properties for residue X were set to zeros. These features can be accessed in supplementary file S3.

Two autocorrelation calculating algorithms, as Geary autocorrelation and normalized Moreau-Broto autocorrelation were adopted here. In this way, the Geary autocorrelation features are defined as:

where d=1, 2, 3, … , 30 is the lag of the autocorrelation, Pi and Pi+d are the particular property values of the amino acids at position i and i+d, respectively. is the average value of the particular property of 20-normamino acids.

Moreau-Broto autocorrelation features are defined as,where d,Pi and Pi+d are previously defined. Therefore, the normalized Moreau-Broto autocorrelation descriptors are defined as:, where N is the length of peptide.

In summary, this block of features are inspired by both CKSAAP-Palm method and PROFEAT method. There are 13*30*2=780 features in this block.

 

 III. Amino acid position weighted matrices(PWMs)

For a given dataset of fixed-length sequence fragments, each amino acid at each position is associated with its frequency of occurrence. Using this frequency in both palmitoylated and non-palmitoylated sets of fragments in training set, two position-weighted matrices (PWMs) are calculated. Each row of the matrix corresponds to one kind of amino acids and every column corresponds to the position in the peptide. The element in the i-th row and j-th column of the matrixis defined as ,where Nij is the showed up times of ith amino acid in the jth position of peptide and Npep is the number of peptides in the whole dataset, i=1,2,3...21, j=1,2,3...51. Since two PWMs were built, 51*2=102 features were extracted in this group.

After all, 5313 features were generated. In order to reduce redundant information, constant features and highly correlated features were excluded in the feature space. In addition, if any two features shared a correlation coefficient greater than 0.85, one of them was removed randomly. At last, 4959 features remained for the next procedure.

 

 

2. Classification method

RandomForest method was implemented using R package randomForest.

 

3. Loss/gain of palitoylation sites by a variation.