1
Feature generation
Each site was represented by a 51-residue long peptide centered at the
cysteine of interest as described above. If the upstream or downstream
residue number is less than 25, symbol X was used to represent the missing
amino acid in the sequence fragment. Following this, each fragment was
transformed into five categories of features, as discussed below.
I. Amino acid composition
Amino acid composition was incorporate via two groups of features: single
amino acid composition in a sequence fragment and K-spaced amino acid pair
composition. Single amino acid composition is the fraction of each amino
acid type in a sequence fragment. K-spaced amino acid pair composition was
first proposed by Chen et al. for the prediction of protein flexible/rigid
regions, and been proved useful for the prediction of O-glycosylation sites
and palmitoylation sites.The K-spaced amino acid pair compositions were
calculated by considering the fraction of amino acid pairs that are
separated by k amino acids within a protein sequence fragment (there are 441
possible pairs, e.g. AA, AC, AD, ..., XX). We refer to such a feature vector
as .Forinstance,
, where NA3C is the number of occurrences of the AC amino acid pair separated by 3 amino acids and N is the residue length of the peptide. In this work, N was set to 51 and k=0,1,...,9 were jointly considered. In total, 4431 compositional features were generated.
II. Autocorrelation of amino acid physicochemical properties
Thirteen amino acid properties were utilized for this feature set. Eight of the properties are (1)the hydrophobicity scale, (2)the average flexibility index, (3) the polarizability parameter, (4)the free energy of solution in water, (5)the residue accessible surface area for a tripeptide, (6)the average volumes of residues, (7)the steric parameter and (8)the relative mutability. The remaining five properties were derived from the literature and assembled by Hu et al. as transformed attributes, including (9) the polarity factor, (10) the secondary structure factor, (11) the molecular volume factor, (12) the codon diversity factor and (13)electrostatic charge factor. All of the amino acid are centralized and standardized before the calculation and the properties for residue X were set to zeros. These features can be accessed in supplementary file S3.
Two autocorrelation calculating algorithms, as Geary autocorrelation and normalized Moreau-Broto autocorrelation were adopted here. In this way, the Geary autocorrelation features are defined as:
where d=1, 2, 3, … , 30 is the lag of the autocorrelation,
Pi and
Pi+d
are
the particular property values of the amino acids at position i and
i+d, respectively.
is
the average value of the particular property of 20-normamino acids.
Moreau-Broto autocorrelation features are defined as:,where
d,Pi
and
Pi+d
are
previously defined. Therefore, the normalized Moreau-Broto autocorrelation
descriptors are defined as:
,
where N is the length of peptide.
In summary, this block of features are inspired by both CKSAAP-Palm method and PROFEAT method. There are 13*30*2=780 features in this block.
III. Amino acid position weighted matrices(PWMs)
For a given dataset of fixed-length sequence fragments, each amino acid at
each position is associated with its frequency of occurrence. Using this
frequency in both palmitoylated and non-palmitoylated sets of fragments in
training set, two position-weighted matrices (PWMs) are calculated. Each row
of the matrix corresponds to one kind of amino acids and every column
corresponds to the position in the peptide. The element in the i-th row and
j-th column of the matrixis defined as
,where
Nij is
the showed up times of ith amino acid in the jth position of
peptide and Npep is
the number of peptides in the whole dataset, i=1,2,3...21, j=1,2,3...51.
Since two PWMs were built, 51*2=102 features were extracted in this group.
After all, 5313 features were generated. In order to reduce redundant information, constant features and highly correlated features were excluded in the feature space. In addition, if any two features shared a correlation coefficient greater than 0.85, one of them was removed randomly. At last, 4959 features remained for the next procedure.
2. Classification method
RandomForest method was implemented using R package randomForest.
3. Loss/gain of palitoylation sites by a variation.