Visualisation of quadratic discriminant analysis and its application in exploration of microbial interactions

Background When comparing diseased and non-diseased patients in order to discriminate between the aspects associated with the specific disease, it is often observed that the diseased patients have more variability than the non-diseased patients. In such cases Quadratic discriminant analysis is required which is based on the estimation of different covariance structures for the different groups. Having different covariance matrices means the Canonical variate transformation cannot be used to obtain a visual representation of the discrimination and group separation. Results In this paper an alternative method is proposed: combining the different transformations for the different groups into a single representation of the sample points with classification regions. In order to associate the differences in variables with group discrimination, a biplot is produced which include information on the variables, samples and their relationship.


Background
The biplot is a useful graphical method of exploring relationships in data. As the prefix 'bi-' suggests, both the samples and variables of a data matrix is represented in a biplot.
The simplest form of a biplot is the Principal component analysis (PCA) biplot which optimally represents the variation in a data matrix [1]. By representing the variables on a calibrated axes [2], sample values can be read-off the axes to reveal relationships between samples and variables.
Another popular plot is a Canonical variate analysis (CVA) plot representing the optimal linear discrimination between samples from different groups, based on the assumption of equal within group variance [3]. By ensuring an aspect ratio of 1:1 is maintained and adding the original variables as calibrated biplot axes, rather than representing the canonical variates which are a mixture of the original variables, a CVA biplot is obtained. The assumption of equal within class variance allows for a single canonical transformation of all samples in all groups to a single canonical space in which the CVA biplot is constructed.
When different groups of observations have different covariance structures, the canonical transformation is not optimal for group separation. For normally distributed data, the theoretical equivalent of Linear discriminant analysis (LDA) in the presence of different group covariance matrices is Quadratic discriminant analysis (QDA).
Varying covariance structures is often found when comparing diseased to healthy patients. The variables affected by the disease have certain typical values in healthy patients. When disease sets in, the values change and change in different ways for different patients and to a different extent depending on the severity of the disease. The result is that a lot more variability is observed for the diseased patients. In an effort to understand the effect of the disease, the differences in groups are analysed by discriminant analysis. Since the covariance matrices differ, QDA can be used, but a visual representation can shed more light on the exact relationships contributing to the differences between health and disease.
In this paper a QDA biplot is suggested to visually represent the optimal separation based on respiratory pathogens in a cohort of children with suspicion of Pulmonary Tuberculosis (TB) infection. In section 2 the known and established methodology of LDA and Canonical Variate Analysis (CVA) biplots is reviewed. Section 3 deals with QDA and the QDA biplot is introduced in section 4. An example is given in section 5 before, the QDA biplot is applied to the data set of respiratory pathogens in children with TB in section 6.

Linear discriminant analysis
We observe a set of n samples or observations on p variables, represented in the data matrix X:n × p which we can assume without loss of generality is centred around the origin so that 1 ' X = 0 '. Of these observations, n j belong to class j, with a total of J classes observed and X J j¼1 n j ¼ n . The class membership can be represented in a matrix G:n × J with g ij = 1 if sample i belongs to class j and 0 otherwise.
Fisher [4] defined LDA as a transformation that maximises the between class variance relative to the within class variance. This is closely related to CVA and multivariate analysis of variance (MANOVA) where the total variance is decomposed into a between class variance and within class variance part: The CVA biplot is constructed from the first r, usually r = 2, sometimes r = 3, columns of M, denoted by M r and the sample points is given by Z = XM r with class means Z = XM r . For more detail on the construction of the CVA biplot and fitting the biplot axes, see Gower and Hand [2] or Gower, Lubbe and le Roux [5].
Note that no assumption on the distribution of the data is made to derive the canonical transformation. However, if the data is normally distributed, such that X|G = jñ ormal(p, μ j , Σ W ), the discrimination function derived at based on equal prior probability of belonging to each of the classes and equal misclassification costs for all classes is equivalent to Fisher's LDA. The prior probabilities, i.e. the probability of belonging to class j, prior to observing the p variables in X, is denoted by π j =P(G=j)s where the discrete random variable G should not be confused with the indicator matrix G.
It is shown in Appendix A that classification of a sample is to the nearest canonical mean in the CVA biplot when the prior probabilities are equal and for unequal prior probabilities, a quantity of log(π j ) is simply added to the distance to the j-th class mean.

Quadratic discriminant analysis
It is assumed that the samples are random realisations from the underlying probability distributions X|G = j~normal(p, μ j , Σ j ) where the common within class covariance matrix Σ W is now replaced with J covariance matrices Σ j .
Where in LDA a sample is classified to class k where it is shown in Appendix B that classification of a sample is now to class k where Where classification for LDA was in terms of Euclidean distance in the canonical space, (u − u j ) ' (u − u j ), in QDA the classification function is of a similar structure, but now in terms of a function ϕ 2 j x ð Þ.

QDA biplot
First a simplified version is considered. Let J = 2 groups and the prior probabilities be In LDA an observation x is transformed to the canonical space, u ' = x ' M r and will be classified to class 1 if (u − u 1 ) ' (u − u 1 ) < (u − u 2 ) ' (u − u 2 ) and to class 2 otherwise. The equivalent QDA classification rule will be: classify to class 1 if x ð Þ gives a two-dimensional scatter plot with the classification boundary defined by the line y = x. Since QDA is specifically applicable in cases with very different covariance structures, it will often be a feature of this plot that one group is spread out while the other is extremely concentrated, typically close to the decision boundary. A better representation can be obtained by scaling each vector ϕ 2 j x ð Þ to unit standard deviation.
The different dimensions for plotting is already obtained by different transformations, therefore a scaling factor unique to each dimension will not add to the complexity of the representation.
Returning to the problem of J different classes, a different transformation is performed for each group. This creates J 'new' variablesφ 2 j x ð Þ−φ j =s j ; j ¼ 1; …; J . Let these be represented in a matrix Φ:n × J. In order to make a two-dimensional biplot, a principal component analysis on Φ gives the best two-dimensional representation of the J variables from with the transformations. The samples are represented by the first two principal components' scores in the biplot. To construct the classification regions, each point z : 2 × 1 in the biplot space is classified to class k ifφ 2 is obtained through back projection as described in Appendix C. Now the plot provides a representation of the samples and classification regions. The term biplot refers to the simultaneous representation of two features of a data set, usually the samples and the variables. The plot can be enhanced to form a biplot, by adding information on the variables. Already in 1978 Kruskal and Wish [6] suggested a regression method for adding linear relationships between the samples and variables in a two dimensional display. The construction of p>2 variables in the display with biplot axes, rather than vectors is discussed in detail in Gower and Hand [2], Greenacre [7] and Gower, Lubbe and le Roux [5].

An example
To illustrate the QDA biplot a simulated data set will be used. In section 3 it was mentioned that QDA is derived for data from J different normal distributions. Here we will use J=3 groups with different means and covariance matrices and 50 samples in each group. The QDA biplot is given in Figure 1. Since simulated data was used, the features of the data are known and it is clear that these features are well represented in the QDA biplot. We have μ 0 1 ¼ 1 1 1 1 ½ which has the lowest values for variables 2, 3 and 4 than Groups 2 and 3. From the biplot we see that group 1 has lower values for all variables except variable 1. Group 2 has more variation that the other two groups which is consistent with the diagonal values of Σ 2 , and lies between Groups 1 and 3. Orthogonally projecting onto the axes of variables 3 and 4, it is clear that Group 3 has the highest values, consistent with μ 33 = μ 34 = 5. In the example above, the data was simulated from a normal distribution so it is known that the QDA methodology is applicable to the specific data set. However, the application of respiratory pathogens contains only indicator variables with 0 = absence and 1 = presence of the pathogen. Before applying the QDA biplot on this data set, the simulated data set is converted into indicator variables with all values less than the median zero and all values larger than or equal to the median being made one. Categorising the data will lead to a loss of information, but we expect some degree of similarity in location and spread between the normally distributed data set and the indicator variable data set. The degree to which the QDA biplots of the two data sets represent the same location, spread and separation features will give an indication of how well the QDA biplot performs in cases where the data does not follow a normal distribution.
The QDA biplot of the indicator variable data set is given in Figure 2. With four variables which can each only take on one of two values (0 or 1), there is only 2 4 = 16 different response patterns possible. In the simulated data set, 15 of the 16 patterns occurred at least once. All identical patterns will be on the same point in the biplot. For each point the symbol displayed is found by majority vote. The same problem with different response patterns does not occur in the application in section 6 since a total of 15 pathogens yields 16,384 different response patterns.
In the QDA biplot in Figure 2 it is clear that the majority of the samples appear in their correct classification regions. This was also the case in Figure 1. The small differences in variable 2 disappear with the course coding and the three groups appear to be similar on variables 1 and 2. Group 1 has the lowest values (most zero's) for variables 3 and 4 while Group 3 has the highest values (most one's) for variables 3 and 4. Again Group 2 appears to be located between Groups 1 and 3. It is comforting to see that the primary location, spread and separation features of the data set did not change between Figures 1 and 2, although converting the data to indicator variables did lead to a loss of information. Moore [8] evaluates discrimination procedures for binary data. Here the focus is on obtaining a visualisation of how the variables relate to the different groups when separating groups with differences in covariance structure. The comparison of Figures 1 and 2 shows that the biplot remains a useful tool for exploring the variables contributing to differences between groups with unequal covariance matrices.

Application: Distribution of respiratory pathogens in a cohort of children with suspicion of pulmonary tuberculosis infection
In this section the QDA biplot will be illustrated with the data set that inspired the development of the plot. Medical researchers were interested in examining the distribution of respiratory pathogens detected in respiratory specimens from children presenting for care with symptoms suggestive of pulmonary tuberculosis. The children are classified into one of three groups: definite-TB (microbiologically confirmed), non-TB (microbiologically confirmed) and possible-TB (microbiologically excluded). Detailed microbiological methods are published elsewhere (In Press). Among other analyses, QDA was performed on the definite and no-TB groups since the possible-TB patients are actually unclassified members of the former two groups. The principal interest of the researchers is to associate some pathogens with the clinical manifestation of definite-TB and some with no-TB. The QDA biplot is given in Figure 3. The method of orthogonal parallel translation of the biplot axes as detailed in Gower, Lubbe and le Roux [5] was applied to move the biplot axes out of the way of the samples to obtain a clearer plot. Since the primary interest of this analysis is not classification, the biplot is a useful tool to visualize how the variables relate to the definite and no-TB groups. Pathogens 2 to 9 are all associated with TB while pathogens 10, 11, 13 and 14 are associated with the no-TB group. Pathogens 1, 12 and 15 seem to have a mixture of definite and no-TB patients. The spread of the sample points from zero on the left towards higher pathogen values in a triangle shape show that for the definite-TB group some patients have little, if any, of the pathogens while some others have some combination of pathogens 2 to 8. Pathogen 9 is the exception which seems to be negatively correlated with pathogens 2 to 8. Similarly, some no-TB patients have few or no of pathogens 10, 11, 13 or 14 while others have a combination of these.
In a pilot study, the visual aid of the biplot provides an easily understandable aid to which pathogens relate to which of the two groups. Actually a total of 33 pathogens were measured, but those not really contributing to the discrimination between definite-TB and no-TB are not shown here.

Conclusion
In cases where the variance between groups differs, QDA should be applied with the estimation of different covariance matrices for different groups. A transformation based on the optimal classification of samples from normal distributions is suggested to construct a QDA biplot. In the biplot both the samples, with classification regions, and original variables are represented, showing the relationships between different groups and the various variables. Through a simple simulation, it was verified that the main characteristics of the plot remains intact, even if the assumption of normality is not justified.
The QDA biplot is not designed in the first place for optimal classification of samples, as this can be performed algebraically with many software programmes. The main purpose of the QDA biplot is to provide a visual representation of the relationships between samples in a specific group and the variables measured.

Appendix A: Linear Discriminant Analysis
For classification of an object the posterior probability of belonging to each of the J groups is calculated, π j|x = π j f X|G (x|G = j), and the sample is classified to the group with largest posterior probability, arg max j π j x j . The posterior probabilities needs to be estimated from the observed data and for the methodology applied in sections 3 and 4, it is important to look at the log odds of the estimated posterior probabilities. Using the estimates x j and pooled sample covariance matrix S p a sample is classified to class J if logπ jj x π Jj x < 0; j ¼ 1; …; J−1 class k if logπ jj x π Jj x < logπ kj x π Jj x ; j ¼ 1; …; J−1; j≠k where the log odds can be written as This means that a sample is classified to group k where n o and the sample x is classified to the group with largest posterior probability, arg max j log π j À Á − 1 2 ϕ 2 j x ð Þ n o .