Skip to main content

Table 1 Summary of sequence based encodings

From: Encodings and models for antimicrobial peptide classification for multi-resistant pathogens

Encoding

Description

Summary

Used in

Used along with

Main

Category

Sparse

each amino acid is represented as an one-hot vector of length 20, where each position, except one, is set to 0

Density: -

Information: +

[15, 19,20,21]

Substitution Matrix, Amino Acid Composition

Sparse encoding

Amino Acid Composition

feature vector contains at each position the proportion of an amino acid in relation with the sequence length

Density: +

Information: -

[22,23,24]

Distance Frequency,

Quantitative Matrix,

Dipeptide Composition,

PseAAC

Amino acid composition

Distance Frequency

calculates the distance between amino acids of similar properties and bins the occurrence according to the gap length

Density: +

Information: +

[22]

 

Amino acid composition

Quantitative Matrix

encodes the propensity of each amino acid at a position

Density: +

Information: +

[23]

 

Amino acid composition

CTD

describes the composition (C), transition (T) and distribution (D) of similar amino acids along the peptide sequence

Density: +

Information: +

[25]

 

Amino acid composition

Pseudo-amino Acid Composition (PseAAC)

computes the correlation between different ranges among a pair of amino acids

Density: +

Information: +

[27,28,29,30]

Dipeptide Composition

Pseudo amino acid composition

Reduced Amino Acid Alphabet

similar amino acids are grouped together

Density: +

Information: o

[9, 32,33,34, 36, 37]

N-gram Model, AAIndexLoc

Reduced amino acid alphabet

N-gram Model

occurrences of n-mers for an alphabet of size m, leading to a mn dimensional, sparse representation of the initial sequence

Density: -

Information: o

[9]

 

Reduced amino acid alphabet

AAIndexLoc

k-nearest neighbor clustering to aggregate amino acids into 5 classes using their amino acid index, i.e., amino acids with the respective highest(T), high (H), medium (M), low (L), and lowest (B) values of a particular physicochemical property are clustered together

Density: o

Information: +

[37]

Dipeptide Composition

Reduced amino acid alphabet

Physicochemical Properties

translation of an amino acid to a particular physicochemical property

Density: o

Information: +

[40, 42, 47,48,49,50,51,52,53]

z-descriptor, d-descriptor and many more

Physicochemical properties

z-descriptor

derived from the principal components of physicochemical properties by means of partial least squares (PLS) projections, PLS leads to a subset of five final features, capable to describe the 20 proteinogenic as well as 67 additional amino acids

Density: +

Information: +

[42, 44]

 

Physicochemical properties

d-descriptor

amino acid sequence is squeezed between the y- (N-terminus) and the x-axis (C-terminus) with gradually bending of the single amino acids and subsequent vector summation

Density: +

Information: +

[54]

 

Physicochemical properties

Autocorrelation

interdependence between two distant amino acids in a peptide sequence

Density: +

Information: +

[57,58,59,60,61]

 

Autocorrelation

Substitution/Scoring Matrix

provide accepted mutations between amino acid pairs, i.e., sequence alterations with either no or positive impact in terms of the protein function

Density: +

Information: +

[65,66,67,68,69,70,71]

BLOMAP, Sparse, Amino Acid Composition, Dipeptide Composition, PseAAC, AAIndexLoc

Substitution and scoring matrix

BLOMAP

incorporates the BLOSUM62 to calculate distances in a high dimensional input space, i.e., the substitution matrix, to a lower dimension, using the Shannon-projection

Density: +

Information: +

[65]

 

Substitution and scoring matrix

Fourier Transformation

to detect underlying patterns in time series, by transforming the time signal to a frequency domain

Density: o

Information: +

[73, 74]

 

Fourier Transformation

  1. + (good), o (neutral/no declaration), − (bad). For instance, “Density: -” means the encoding results in a high dimensional feature space and “Information: +” reflects a representative mapping from the residue sequence to the numerical vector. “o” denotes encodings, which are difficult to classify, due to missing details in the respective publication or can be considered as neutral. In general, the classification rests upon the authors experience and shall support researchers to quickly grasp suitable encodings. Nevertheless, an encoding which has been rated “-” still might work well for a particular application and should by no means regarded as the final evaluation