Skip to main content

Table 1 Summary of sequence based encodings

From: Encodings and models for antimicrobial peptide classification for multi-resistant pathogens

Encoding Description Summary Used in Used along with Main
Sparse each amino acid is represented as an one-hot vector of length 20, where each position, except one, is set to 0 Density: -
Information: +
[15, 19,20,21] Substitution Matrix, Amino Acid Composition Sparse encoding
Amino Acid Composition feature vector contains at each position the proportion of an amino acid in relation with the sequence length Density: +
Information: -
[22,23,24] Distance Frequency,
Quantitative Matrix,
Dipeptide Composition,
Amino acid composition
Distance Frequency calculates the distance between amino acids of similar properties and bins the occurrence according to the gap length Density: +
Information: +
[22]   Amino acid composition
Quantitative Matrix encodes the propensity of each amino acid at a position Density: +
Information: +
[23]   Amino acid composition
CTD describes the composition (C), transition (T) and distribution (D) of similar amino acids along the peptide sequence Density: +
Information: +
[25]   Amino acid composition
Pseudo-amino Acid Composition (PseAAC) computes the correlation between different ranges among a pair of amino acids Density: +
Information: +
[27,28,29,30] Dipeptide Composition Pseudo amino acid composition
Reduced Amino Acid Alphabet similar amino acids are grouped together Density: +
Information: o
[9, 32,33,34, 36, 37] N-gram Model, AAIndexLoc Reduced amino acid alphabet
N-gram Model occurrences of n-mers for an alphabet of size m, leading to a mn dimensional, sparse representation of the initial sequence Density: -
Information: o
[9]   Reduced amino acid alphabet
AAIndexLoc k-nearest neighbor clustering to aggregate amino acids into 5 classes using their amino acid index, i.e., amino acids with the respective highest(T), high (H), medium (M), low (L), and lowest (B) values of a particular physicochemical property are clustered together Density: o
Information: +
[37] Dipeptide Composition Reduced amino acid alphabet
Physicochemical Properties translation of an amino acid to a particular physicochemical property Density: o
Information: +
[40, 42, 47,48,49,50,51,52,53] z-descriptor, d-descriptor and many more Physicochemical properties
z-descriptor derived from the principal components of physicochemical properties by means of partial least squares (PLS) projections, PLS leads to a subset of five final features, capable to describe the 20 proteinogenic as well as 67 additional amino acids Density: +
Information: +
[42, 44]   Physicochemical properties
d-descriptor amino acid sequence is squeezed between the y- (N-terminus) and the x-axis (C-terminus) with gradually bending of the single amino acids and subsequent vector summation Density: +
Information: +
[54]   Physicochemical properties
Autocorrelation interdependence between two distant amino acids in a peptide sequence Density: +
Information: +
[57,58,59,60,61]   Autocorrelation
Substitution/Scoring Matrix provide accepted mutations between amino acid pairs, i.e., sequence alterations with either no or positive impact in terms of the protein function Density: +
Information: +
[65,66,67,68,69,70,71] BLOMAP, Sparse, Amino Acid Composition, Dipeptide Composition, PseAAC, AAIndexLoc Substitution and scoring matrix
BLOMAP incorporates the BLOSUM62 to calculate distances in a high dimensional input space, i.e., the substitution matrix, to a lower dimension, using the Shannon-projection Density: +
Information: +
[65]   Substitution and scoring matrix
Fourier Transformation to detect underlying patterns in time series, by transforming the time signal to a frequency domain Density: o
Information: +
[73, 74]   Fourier Transformation
  1. + (good), o (neutral/no declaration), − (bad). For instance, “Density: -” means the encoding results in a high dimensional feature space and “Information: +” reflects a representative mapping from the residue sequence to the numerical vector. “o” denotes encodings, which are difficult to classify, due to missing details in the respective publication or can be considered as neutral. In general, the classification rests upon the authors experience and shall support researchers to quickly grasp suitable encodings. Nevertheless, an encoding which has been rated “-” still might work well for a particular application and should by no means regarded as the final evaluation