Encodings and models for antimicrobial peptide classification for multi-resistant pathogens

Spänig, Sebastian; Heider, Dominik

doi:10.1186/s13040-019-0196-x

BioData Mining

Table 1 Summary of sequence based encodings

From: Encodings and models for antimicrobial peptide classification for multi-resistant pathogens

Encoding	Description	Summary	Used in	Used along with	Main Category
Sparse	each amino acid is represented as an one-hot vector of length 20, where each position, except one, is set to 0	Density: - Information: +	[15, 19,20,21]	Substitution Matrix, Amino Acid Composition	Sparse encoding
Amino Acid Composition	feature vector contains at each position the proportion of an amino acid in relation with the sequence length	Density: + Information: -	[22,23,24]	Distance Frequency, Quantitative Matrix, Dipeptide Composition, PseAAC	Amino acid composition
Distance Frequency	calculates the distance between amino acids of similar properties and bins the occurrence according to the gap length	Density: + Information: +	[22]		Amino acid composition
Quantitative Matrix	encodes the propensity of each amino acid at a position	Density: + Information: +	[23]		Amino acid composition
CTD	describes the composition (C), transition (T) and distribution (D) of similar amino acids along the peptide sequence	Density: + Information: +	[25]		Amino acid composition
Pseudo-amino Acid Composition (PseAAC)	computes the correlation between different ranges among a pair of amino acids	Density: + Information: +	[27,28,29,30]	Dipeptide Composition	Pseudo amino acid composition
Reduced Amino Acid Alphabet	similar amino acids are grouped together	Density: + Information: o	[9, 32,33,34, 36, 37]	N-gram Model, AAIndexLoc	Reduced amino acid alphabet
N-gram Model	occurrences of n-mers for an alphabet of size m, leading to a mⁿ dimensional, sparse representation of the initial sequence	Density: - Information: o	[9]		Reduced amino acid alphabet
AAIndexLoc	k-nearest neighbor clustering to aggregate amino acids into 5 classes using their amino acid index, i.e., amino acids with the respective highest(T), high (H), medium (M), low (L), and lowest (B) values of a particular physicochemical property are clustered together	Density: o Information: +	[37]	Dipeptide Composition	Reduced amino acid alphabet
Physicochemical Properties	translation of an amino acid to a particular physicochemical property	Density: o Information: +	[40, 42, 47,48,49,50,51,52,53]	z-descriptor, d-descriptor and many more	Physicochemical properties
z-descriptor	derived from the principal components of physicochemical properties by means of partial least squares (PLS) projections, PLS leads to a subset of five final features, capable to describe the 20 proteinogenic as well as 67 additional amino acids	Density: + Information: +	[42, 44]		Physicochemical properties
d-descriptor	amino acid sequence is squeezed between the y- (N-terminus) and the x-axis (C-terminus) with gradually bending of the single amino acids and subsequent vector summation	Density: + Information: +	[54]		Physicochemical properties
Autocorrelation	interdependence between two distant amino acids in a peptide sequence	Density: + Information: +	[57,58,59,60,61]		Autocorrelation
Substitution/Scoring Matrix	provide accepted mutations between amino acid pairs, i.e., sequence alterations with either no or positive impact in terms of the protein function	Density: + Information: +	[65,66,67,68,69,70,71]	BLOMAP, Sparse, Amino Acid Composition, Dipeptide Composition, PseAAC, AAIndexLoc	Substitution and scoring matrix
BLOMAP	incorporates the BLOSUM62 to calculate distances in a high dimensional input space, i.e., the substitution matrix, to a lower dimension, using the Shannon-projection	Density: + Information: +	[65]		Substitution and scoring matrix
Fourier Transformation	to detect underlying patterns in time series, by transforming the time signal to a frequency domain	Density: o Information: +	[73, 74]		Fourier Transformation

+ (good), o (neutral/no declaration), − (bad). For instance, “Density: -” means the encoding results in a high dimensional feature space and “Information: +” reflects a representative mapping from the residue sequence to the numerical vector. “o” denotes encodings, which are difficult to classify, due to missing details in the respective publication or can be considered as neutral. In general, the classification rests upon the authors experience and shall support researchers to quickly grasp suitable encodings. Nevertheless, an encoding which has been rated “-” still might work well for a particular application and should by no means regarded as the final evaluation

Back to article page

ISSN: 1756-0381

Contact us

General enquiries: journalsubmissions@springernature.com