New neural network classification method for individuals ancestry prediction from SNPs data

Artificial Neural Network (ANN) algorithms have been widely used to analyse genomic data. Single Nucleotide Polymorphisms(SNPs) represent the genetic variations, the most common in the human genome, it has been shown that they are involved in many genetic diseases, and can be used to predict their development. Developing ANN to handle this type of data can be considered as a great success in the medical world. However, the high dimensionality of genomic data and the availability of a limited number of samples can make the learning task very complicated. In this work, we propose a New Neural Network classification method based on input perturbation. The idea is first to use SVD to reduce the dimensionality of the input data and to train a classification network, which prediction errors are then reduced by perturbing the SVD projection matrix. The proposed method has been evaluated on data from individuals with different ancestral origins, the experimental results have shown the effectiveness of the proposed method. Achieving up to 96.23% of classification accuracy, this approach surpasses previous Deep learning approaches evaluated on the same dataset.


Introduction
The human genome contains three billion of base pairs, with only 0.1% difference between individuals [1]. The most common type of genetic variations between individuals is called Single Nucleotide Polymorphism (SNP) [2]. An SNP is a change from one base pair to another, which occurs about once every 1000 bases. Most of these SNPs have no impact on human health. However, many studies have shown that some of these genetic variations have important biological effects and are involved in many human diseases [3,4]. SNPs are commonly used to detect genes associated with the development of a disease within families [5]. In addition, SNPs can also help to predict a person's response to drugs or their susceptibility to develop one or more particular diseases. In genetics, Genome-Wide Association Studies (GWAS) are observational studies that use highthroughput genotyping technologies to identify a set of genetic variants that are associated to a given trait or disease [6], by comparing variants in a group of cases with variants in a group of controls. However, this approach is only optimal for populations from the same ancestry group, as it is challenging to dissociate the variations associated with a disease from those that characterize the genetic of human populations. In this context, numerous machine learning algorithms have been used to classify individuals according to genetic differences that affect their population. Support Vector Machines (SVM) methods have been applied to infer recent genetic ancestry of a subgroup of communities in the USA [7] or coarse ethnicity [8]. However, SVM methods are very sensitive to the choice of kernel and its parameters [9]. Deep learning algorithms, such as Neural Networks have been widely used to analyse genomic data as well as gene expression data to classify certain diseases [10][11][12][13][14][15][16][17][18][19][20]. But, the high dimensionality of genomic data (when the number of input features is several times higher than the number of training examples) makes the learning task very difficult. Indeed, when data is composed of a large number of input features m for a small number of samples n (n << m), the problem of overfitting becomes inevitable. In general, overfitting in machine learning occurs when a model fits well with the training data, but not fit the unseen data. The model learns details and noise in the training data, which negatively impact the performance of the model on new data. One way to avoid the problem of overfitting is to reduce the complexity of the problem by removing features that do not contribute or decrease the accuracy of the model [21]. Different techniques are used to deal with the problem of overfitting. The most wellknown ones are L 1 and L 2 regularizations [22]. The idea of these techniques is to penalize the higher weights in the model by adding extra terms in the loss function. Another commonly used regularization technique, called "Dropout", introduced by Hinton et al. [23] consists of dropping neurons at random (in hidden layers) in each learning round. However, with such difference between the number of features versus the number of samples, it increases the problem of overfitting. To overcome this problem, dimensionality reduction techniques need to be combined with unsupervised learning methods or other data preprocessing techniques.
There are many ways to transform a high-dimensional data to low-dimensional data, Singular Value Decomposition (SVD), Principal Component Analysis (PCA)) and Autoencoder(AE) are the most common dimensional reduction techniques. SVD and PCA are the most popular linear dimensionality reduction techniques. Both attempt to find k orthogonal dimensions in an n-dimensional space, so that k < n. They are related to each other, but PCA uses the covariance matrix of the input data, while SVD is performed on the input matrix itself. The Autoencoder is a Neural Network that tries to reconstruct the input data from their compressed form. Indeed, the Autoencoder is used as a method of non-linear dimensionality reduction, it works by mapping an n-dimensional input data into a k-dimensional data (with k < n).
Recently, ANNs have been used in many works to analyse sequencing data and predict complex diseases using SNPs data [11,[24][25][26][27][28][29]. To analyse SNPs from sequences [16,26,30], many approaches have been proposed to deal with high dimensionality by combining dimensionality reduction techniques, such as unsupervised methods followed by supervised Neural Networks for classification [11,13,[31][32][33]. For instance, Zhou et. al. [11] used a three-step Neural Network to characterise the determinants of Alzheimer's disease. Liu et al. [34] combined Deep Neural Network with an incremental way to select SNPs and multiple Dropouts regularization techniques. Kilicarslan et al. [32] used a hybrid model consisting of Relief and stacked Autoencoder as dimensionality reduction technique followed by Support Vector Machines (SVM) and Convolutional Neural Networks (CNNs) for diagnosis and classification of cancer samples. Khan et al. [35] used PCA and Neural Network to identify relevant genes and classify cancer samples. Fakoor et al. [14] combined PCA with Sparse Autoencoder to improve cancer diagnosis and classification. Romero et al. [33] proposed to reduce the hyperparameters of the classification network by the use of auxiliary networks. Pirmoradi et al. citepir-moradi2020self used Deep Auto-Encoder approach to classify complex diseases from SNPs data. Based on our literature review, Romero et al. are the first to use Deep learning algorithms on SNP data for genetic ancestry prediction task. They constructed a classification network with an optional reconstruction path and proposed two auxiliary Neural Networks to predict the parameters of the first layer of the classification network and its reconstruction path respectively. They proposed several types of embedding techniques to reduce the number of free parameters in the auxiliary networks, such as Random projection(RP), Per class histogram, SNPtoVec, Embedding learnt end-to-end from raw data.
In this work, we propose a New Classification Neural Network based on the perturbation of the input matrix. To address the problem of dimensionality, the training model is constructed in three steps followed by a test phase: (1) use SVD to reduce the dimension of the input data, (2) train a Multi-Layer Perceptron (MLP) to perform classification tasks, (3) perturb the SVD projection matrix in the sense to minimize the training loss function. In the test phase, the test set is multiplied by the perturbed projection matrix to evaluate the performance of the classifier.
The main contribution of this paper, is how the projection matrix is perturbed after the model is trained. This perturbation is inspired by the Targeted Attacks Method, which aims is to change the inputs so that the network classify them into any desired class [36][37][38][39][40]. These inputs are called Adversarial Examples. Previews works on target attacks have been used in image analysis, such as image segmentation [41], face detection [42] or image classification [43]. There are many ways of producing adversarial examples [44][45][46], the most commonly used one is Fast Gradient Sign Method (FGSM) and its variants [40,47]. The proposed approach uses FGSM to perturb the input data iteratively to maximize the probability that each output sample falls into the desired class. Other variants of this method, such as Projected Gradient Descent [45], Basic Iterative Method [47], Boosting FGSM with Momentum [48] and many other gradients based methods, could be used [49][50][51]. For instance, the Projected Gradient Descent is considered as one of the most effective algorithms to generate adversarial samples. However, this method is too time-consuming to be used for training. FGSM is a very simple and fast method of generating adversarial examples [40]. The objective is to obtain a good representation of input features in SVD projection space, which will be obtained after calculating the perturbed input of the training data.
This work is organized as follows: the proposed method and the dataset used are described in "Material and methods" section, the obtained results are reported in "Results" section and the experiments are discussed in "Discussion" section.

Material and methods
The proposed approach uses SVD to reduce the number of free parameters of the classification network. However, others dimensionality reduction techniques could be used. For instance, Per class histogram method [33] is a very simple dimensionality reduction technique. The idea of this technique is to represent each feature (SNP) in the input data by 3 possible values, corresponding to the proportion of ethnic groups having as genotype 0, 1 or 2 respectively. This produces a projection matrix of size m × 78, where m is number of features. Once the input dimension is reduced, a classification network is trained to find the optimal weight matrix. A perturbed projection matrix is then computed by simply solving a linear system as described in the "Description of the model" section.

Data description
1000 Genomes Project set up in 2008 [52], is an international research consortium which aims to produce a detailed catalog of humans genetic variations, from approximately one thousand volunteers from different ethnic groups, with frequencies larger than 1%. It is the first project to sequence the genome of a large number of people from different populations, regions and countries. Data made available to the international community comprises SNP profiles of the volunteers (see Fig. 1a), which is a vector where the coordinates are the values taken in a fixed position in the genome sequence (homozygous reference, heterozygous or homozygous alternate).
At each locus (fixed position in the genome sequence), an SNP is represented by its genotype that takes three possible values for a diploid organism: AA for homozygous reference, AB for heterozygote and BB for homozygous alternate (see Fig. 1b). The homozygous reference corresponds to a locus where the two base pairs inherited from the parents are identical to the one in the reference genome, the heterozygous corresponds to a locus where the two base pairs found are different and homozygous alternate refers to a locus where the two base pairs found are identical and different from the reference base pair.
The dataset taken as input for the model is a matrix X ∈ R 3450×315345 . The rows of the matrix correspond to individuals (1000Genome's volunteers), the columns correspond to SNPs positions, and the elements are 0, 1 or 2 (corresponding to the three possible values taken by an SNP). 3450 is the number of individuals sampled worldwide from 26 population groups from the 5 continents (see Appendices) and 315345 is the number of included features (SNPs positions).
We use a classification Neural Network composed of an input layer, an output layer and two hidden layers with 100 neurons . This neural network is constructed using Keras and Tensorflow open source libraries. Given the input matrix X, the output of the model is a vector Y ∈ R c whose components correspond to the population groups (26 classes in the used example). A relu activation function is used in the two hidden layers followed by a softmax layer to perform ancestry prediction.

Singular value decomposition
Before applying SVD, input data set is divided into two sets, the training set and the test set. SVD takes as input the training set matrix transpose denoted by X T ∈ R m×n (m > n) with rank(X) = r and decomposes it into a product of three matrices [54]; two orthogonal matrices U ∈ R m×m and V ∈ R n×n and a matrix = diag(σ 1 , σ 2 , . . . , σ n ) ∈ R m×n , σ i > 0 for 1 ≤ i ≤ r, σ i = 0 for i ≥ r + 1, such that The first r columns of the orthogonal matrices U and V are, respectively, the right and the left eigenvectors associated with the r nonzero eigenvalues of X T X. U i , V i and i are, respectively, the ith column of U, V and . The diagonal elements of are the nonnegative square roots of the n eigenvalues of X T X.
The dimension of the input matrix X is then reduced by projecting it onto a space spanned by {U 1 , U 2 , . . . , U k }, the top k (k ≤ r) singular vectors of X. Given a set of samples x 1 , x 2 , . . . , x N of dimension m, the projection matrix U k whose columns are formed by the k first singular vectors of X must minimize where P is the projection defined by : The input data in reduced dimension is denoted by X = X U k .

Description of the model
Let's consider a L hidden layers of a Multi-Layer Perceptron(MLP), in which n input training samples X = {x 1 , x 2 , . . . , x n } are labeled, i.e., for each input x i , the corresponding output by the model is known and denoted y i or Y (x i ). Y is a vector that contains all the labels. A MLP can be described as follows: where z l j , b l j and a l j a 0 j = x j , for an input x = (x 1 x 2 . . . x d ) T are the jth hidden unit, bias term and activation function of layer l, respectively. w l ij is the weight that links the ith unit of the (l − 1)th layer to the jth unit of the lth layer. w l j and a (l−1) are, respectively, the incoming weight vector to the jth neuron of layer l and the output vector of (l-1)th layer, φ is any activation function. Learning the model consists in finding all the parameters w j and b j so that the output a L from the model approximates the true output vector y(x), for all training inputs x.. For simplification, we consider that there are no bias terms b l j or simply we consider it as an additional component of w l j and denote by W l the matrix whose columns are the vectors w l j (Fig. 2). Due to the high dimension of the input data, the proposed approach consists to first project the original data onto a lower dimensional space using SVD. Once the dimension of the input data is reduced, a multilayer perceptron (MLP) classification network is constructed in three steps: Step 1 Learning the weight matrix W : First, a classification network(see Fig. 3) is trained to find W * , the optimal weight by solving: Where C W (X , Y ) = ||φ W (X ) − Y || 2 2 andŶ = φ W * (X ). φ W is the output activation function for the weight matrix W. Y represents the true classification labels.

Fig. 3 Classification network before input perturbation(MLP)
Step 2 Input matrix perturbation X : Once the classification network is sufficiently trained, its weight matrix W * is fixed and the training input matrix X is perturbed to find X * solution of the following problem : To perturb the input data, we use an iterative version of FGSM (see Appendices: Fast gradient sign method) that adds a non random noise whose direction is opposed to the gradient of the loss function.
Step 3 Projection matrix perturbation U k : After finding the optimal perturbation X * , we look for a perturbed projection matrix U k * by solving the following linear system : Where X is the original training matrix and V any matrix, with the same size as U k . After the three construction steps , the output of the MLP, iŝ Once U k * is calculated, we project the original test set on the latter to evaluate the performance of the classification network.
It is worth noting that, after recovery of the perturbed inputs, the classification network (see Fig. 4) can be re-trained or tested with the fixed weight matrix W * (in Step 2). From Step 1 and after having solved the system (4), the input matrix X can be perturbed by solving : But the high dimensionality of input data makes the non-linear optimization problem difficult to solve and the results less accurate.

Results
In this section, the obtained results using the proposed method are reported and its performance is compared to that of the once recommended in [33] (the Per class histogram, see Appendices: Thin parameters for fat genomics, Table 2).

Proposed method
In the table below, we summarize the accuracy of the classification with respect to the number of modes (principal components) k chosen between 20 and 1000.   Table 1 represents in the second column (resp. third column) the results obtained by the classification network before (resp. after) input perturbation. After input perturbation, the training model can be evaluated using the fixed weight matrix (in the third column) as well as re-trained (in the last column). It is clear from the above results that input perturbation has significantly reduced misclassification.
To illustrate the effectiveness of the proposed method, we display the confusion matrix of our classification network to see the effect of input perturbation.
In Fig. 4a (before input perturbation), we observe high classification errors between some population groups such as Chinese Dai in Xishuangbanna and the Kinh in Ho Chi Minh City; Indian Telugu in the UK and Sri Lankan Tamil in the UK; or British in England and Scotland and Utah Residents (CEPH) with Northern and Western Ancestry. Figure 4b shows how our approach has reduced these misclassifications, particularly the classification error between the CDX and KHV classes from 0.95% to 0.05%.
However, as the number of modes increases and the classification errors decrease, one can notice throughout our experiences a weak classification error between the British ethnic groups in England, Scotland and Utah Residents (CEPH) with Northern and Western Ancestry, who appear to be genetically very similar (Figs. 5, 6, 7, 8 and 9).

Per class histogram
In Fig. 10, we present confusion matrices obtained by per histogram embedding methods and Per class histogram embedding input perturbation. Perturbing per class embedding input reduced misclassification errors and allowed the classifier to reach 94,49% of accuracy.

Discussion
Deep learning application to high-dimensional genomic data, such as SNPs is more challenging. In order to deal with problems of high dimensionality, many efforts have been made. In [11], the authors proposed to learn the feature presentations using a Neural Network followed by another classification network. Unsupervised clustering or Deep Autoencoder is jointly trained with a classification network [13,32,33,55]. However, these methods are generally applied to datasets with relatively small features where, the computational cost increases linearly with the number of features and they require more training samples to converge. When Autoencoder network was trained jointly with the classification network on the used dataset, the best accuracy obtained was 85.36%. In Fig. 10 Per class histogram confusion matrix addition to the high dimensionality of the data, there is another challenge related to the high genetic similarity between certain population groups. To mitigate these difficulties, the proposed method reduces the dimension of the input data using SVD algorithm. However, the SVD algorithm extracts linear combinations of features from the input data and fails to take into account the genetic similarity between some population groups as shown in Figs. 4a-10a. To improve these results, the SVD projection matrix is modified to minimize the training loss function of the classification network using FGSM algorithm. The FGSM algorithm allowed us to find the best representation of the input features in SVD projection space. This new representation makes the classification network more robust to small variations in the input and takes into account the genetic similarity between different populations, as shown in the last two columns of Table 1 and Figs. 4b-10b. We are not limited to the SVD algorithm, when Per class histogram is used to reduce the dimension of the input data, the proposed perturbation has significantly reduced classification errors.
The proposed method has achieved its best results when the input features were reduced from 300M to 50, which means that the number of free parameters of the classification network has reduced by a factor of 6000. This method outperforms previous work (see Appendices: Thin parameters for fat genomics) in term of accuracy and the number of free parameters required by the model. For future work, we expect to improve this method by using different targeted attacks algorithms with other dimensionality reduction techniques.

Conclusion
In this work, we proposed a New Neural Network method for the prediction of individual ancestry from SNPs data. To deal with the high dimensionality of the SNPs data, our approach first uses SVD to reduce the dimensionality of its inputs, then train a classification network and then reduce prediction errors by perturbing the input data set.
The obtained results showed how input perturbation reduced classification errors despite genetic similarities between some ethnic groups. With such accuracy in the task of predicting genetic ancestry, this method will make it possible to deal with more complex problems in the healthcare field. We therefore, intend to apply our method to gene expression profiles as well as SNPs data in order first to predict and then prevent the development of patients genetic diseases.

Appendices
Fast gradient sign method FGSM ( [40]) : uses the gradient of the loss function to determine in which direction the input data features should be changed to minimize the loss function : is a tunable parameter. Iterative Fast Gradient Sign Method (IFGSM) consists in adding the perturbation iteratively [47]. In our context, given any input training sample z i ( a row of the training input matrix X) and its corresponding one-hot label y c , we pertub it in the direction of the input space which yields to the hightest decrease of the loss function C W * , using the Targeted Iterative Fast Gradient Sign Method (IFGSM) given by the formula :

Thin parameters for fat genomics
We represent in Table 2, different results from [33].

Authors' contributions
The author(s) read and approved the final manuscript.

Funding
This project was partly funded by H3ABioNet, which is supported by the National Institutes of Health Common Fund under grant number U41HG006941. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.