 Research
 Open Access
 Published:
Differential coexpression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scalefree structure
BioData Mining volume 8, Article number: 5 (2015)
Abstract
Background
Biological insights into group differences, such as disease status, have been achieved through differential coexpression analysis of microarray data. Additional understanding of group differences may be achieved by integrating the connectivity structure of the differential coexpression network and pergene differential expression between phenotypic groups. Such a global differential coexpression network strategy may increase sensitivity to detect genegene interactions (or expression epistasis) that may act as candidates for rewiring susceptibility coexpression networks.
Methods
We test two methods for inferring Genetic Association Interaction Networks (GAIN) incorporating both differential coexpression effects and differential expression effects: a generalized linear model (GLM) regression method with interaction effects (reGAIN) and a Fisher test method for correlation differences (dcGAIN). We rank the importance of each gene with complete interaction network centrality (CINC), which integrates each gene’s differential coexpression effects in the GAIN model along with each gene’s individual differential expression measure. We compare these methods with statistical learning methods ReliefF, Random Forests and Lasso. We also develop a mixture model and permutation approach for determining significant importance score thresholds for network centralities, ReliefF and Random Forest. We introduce a novel simulation strategy that generates microarray case–control data with embedded differential coexpression networks and underlying correlation structure based on scalefree or ErdosRenyi (ER) random networks.
Results
Using the network simulation strategy, we find that ReliefF and reGAIN provide the best balance between detecting interactions and main effects, plus reGAIN has the ability to adjust for covariates and model quantitative traits. The dcGAIN approach performs best at finding differential coexpression effects by design but worst for main effects, and it does not adjust for covariates and is limited to dichotomous outcomes. When the underlying network is scale free instead of ER, all interaction network methods have greater power to find differential coexpression effects. We apply these methods to a public microarray study of the differential immune response to influenza vaccine, and we identify effects that suggest a role in influenza vaccine immune response for genes from the PI3K family, which includes genes with known immunodeficiency function, and KLRG1, which is a known marker of senescence.
Background
In coexpression analysis, the correlation between pairs of genes is typically combined into a network model of the correlation structure, which facilitates secondary network analysis such as community structure or centrality [1]. However, the correlation between pairs of genes in a coexpression network typically is assumed to be uniform across all samples (e.g., tissue types, treatment conditions, disease status, etc.). Yet it is often intergroup differences in correlated data that are of biological or clinical interest. For example, a gene coexpression network in microarray data for chronic lymphocytic leukemia using known biomarkers was able to predict treatment outcomes in an independent sample [2]. A differential coexpression network approach that leverages the genetic network information may yield novel biomarkers and improved prediction.
Differential expression methods compute the mean difference between groups for each gene but typically do not incorporate conditional variation from other genes in the data that may help explain the betweengroup variation. Differential coexpression computes the mean pairwise correlation difference between groups [3]. While the change in a gene’s expression may influence the phenotype in isolation or have only a single pairwise interaction with another gene, it is more likely that changes in a gene’s expression will have a cascading effect with the emergence of multiple differential coexpression effects due to the underlying biological network structure. In the current study, we combine differential expression and differential coexpression effects into a single network model and determine the importance of genes based on a network centrality score that models additional phenotypic variation from gene expression data.
The ability to detect susceptibly hubs in differential gene expression networks depends on the properties of the underlying biological network. For example, scalefree networks exhibit a powerlaw distribution and are characterized by having a few hubs and many nodes with low degree [4]. It is known that the targeted mutation of hubs (central proteins) in yeast proteinprotein interaction networks is more likely to be lethal than the mutation of low degree (noncentral) proteins, which is referred to as the centralitylethality rule [5]. A variety of biological networks have displayed evidence of scalefree behavior, such as metabolic networks [6], protein–protein interaction networks [7] and transcriptional networks [8].
For a scalefree biological network the probability of a random mutation occurring to a hub is small relative to nonhubs; thus, hubs may be probabilistically insulated by the presence of many noncentral nodes. Despite this protected status of hubs, there is a potential for hubs to show a differential coexpression effect without themselves being mutated. The potential for this sideeffect can be understood by noting that random mutations are more likely to occur to nonhubs, but a mutated nonhub has a high probability of being connected to a hub and, hence, this hub may show differential coexpression despite not being mutated.
In previous work we developed a variation of PageRank centrality to find susceptibility hubs for epistasis network analysis of rare and common variant data [911]. We utilized this centrality with our epistasis network inference method called regressionbased Genetic Association Interaction Network (reGAIN) that combines main effects and epistatic effects into a network model of a given phenotype. Epistasis is the deviation from the additive effect of DNA variants, but a similar effect can be observed at the expression level, where the phenotypic effect of one gene is modified depending on the expression of another gene [12,13]. Differential coexpression represents an example of this more general “expressionepistasis” effect.
Testing on simulated data is important for validating the proposed differential network methods. However, there is a lack of methods to simulate artificial gene expression data with differential coexpression or expressionepistasis effects while also including a realistic underlying network structure. Thus, we introduce a networkbased simulation algorithm for constructing artificial gene expression data sets to assess the ability of statistical methods to identify significant hubs of differential coexpression. The underlying networks are designed to have degree distributions that may be either scalefree or ErdosRenyi (ER) random.
This network simulation strategy, which uses specific degree distributions and random disruptions to the correlation within the case group, allows us to address the effect of the degree distribution on the ability to detect differential coexpression effects, network transitivity and other network effects. We assess the true and false positive rates under a variety of simulation conditions. We compare the ability to identify genes involved in differential coexpression for different edge inference approaches, including Fisher transformed ztest for differential correlation (dcGAIN) with a ttest on the diagonal and the generalized linear regression model (reGAIN).
Unlike statistical inference testing methods based on analytical nullhypothesis distributions, network centrality scores and ReliefF scores lack an analytical null distribution. Thus, we use permutation and mixture model density estimation to determine critical values of the centrality scores for null model rejection for genes in the network. The mixture model concept is similar to approaches for modeling microarray data pvalues as a mixture of null and alternative densities [14,15]. We also use the mixture model and permutation methods to find statistical thresholds for machine learning comparison methods ReliefF [12,16] and Random Forest [17] importance scores. To understand the role of main effects in the network, we include Lasso as a comparison method [18]. In addition to realistic artificial data, we apply the methods to a seasonal influenza study with pre and postvaccination microarray and antibody response data [19].
Methods
dcGAIN and reGAIN for constructing the interaction network
Prior to centrality analysis for ranking genes, described below, we must construct a matrix that encodes the statistical interaction or differential coexpression between genes. There are multiple ways to calculate the statistical interaction between genes in a genetic association interaction network (GAIN). Here we describe two methods for constructing the matrix elements of the interaction network.
Regression GAIN (reGAIN) using the generalized linear model
To model differential coexpression between transcripts in reGAIN, we use the generalized linear model with a full interaction logistic regression model [20]:
The values of the outcome variable D are the case (D = 1) and control (D = 0) status. The predictor variables are the gene expression levels for genes G_{ i } and G_{ j }. In the reGAIN, we use the standardized coefficient for the multiplicative interaction term, b_{ij}, as the offdiagonal elements of the interaction matrix A in the gene centrality calculation (in Eq. 5). The diagonal elements of the reGAIN matrix are the regression coefficients for a singlegene model.
Differential coexpression GAIN (dcGAIN) using the Fisher Ztest
To model differential coexpression in dcGAIN, we use the Fisher Ztest [21] by the following steps. First the correlation is calculated between pairs of genes i and j for subjects within each phenotype group, where the groups again are specified by D = 1 (cases) and D = 0 (controls):
The withingroup correlation values are Fisher ztransformed:
Finally the following test statistic is computed for the difference of the ztransformed correlation between groups D = 1 of size m_{1} and D = 0 of size m_{0} for genes i and j:
For the dcGAIN offdiagonal elements of A in Eq. (5), we use Z_{ij}, the Fisher Ztest for intergroup difference in correlation between genes i and j. For the dcGAIN diagonal elements of A, we use a ttest for the individual genes.
Interaction network centrality
To compute the importance of genes that show differential coexpression or statistical interaction effects at the expression level (expression epistasis), we use a generalization of a centrality approach that we developed for epistasis networks from GWAS data, called SNPrank [911]. Here we briefly discuss the relevant aspects of the method and modifications for gene expression interaction networks. This algorithm operates on a network, encoded as a weighted matrix, A, with diagonal terms that correspond to the main effect association of the gene on the phenotype and offdiagonal terms that correspond to the interaction effect on the phenotype or differential coexpression. The matrix elements of A are created either with dcGAIN or reGAIN, discussed above, and then we use the Complete Interaction Network Centrality (CINC) method to calculate each gene’s importance. “Complete” refers to the inclusion of main effects and interactions.
The CINC works by solving the following system of N equations for the vector R whose values correspond to the CINC centrality score for the corresponding gene or transcript i:
where N is the number of genes and A is a weighted matrix of size NxN. For reGAIN, the diagonal elements of A correspond to main effect estimates from a single gene logistic model and offdiagonal elements correspond to statistical interactions b_{ij} in Eq. (1). For dcGAIN, offdiagonals are the Z_{ij}, the Fisher Ztest for intergroup difference in correlation between genes i and j, and diagonal elements of A are ttests for the individual genes. The Tr(A) is the trace of the A matrix, k_{j} is the jth element of the weighted degree vector of A (row sums of A), and γ is the socalled damping factor, which we usually assign the value 0.85 based on simulation studies of epistasis networks [11]. The 1/N terms in Eq. 5 give all genes a uniform baseline importance. One can see from this equation that the importance of gene i depends on its main effect (A_{ii}) and the linear combination of importance scores of all of its connections (A_{ij}*R_{j}).
Statistical thresholds for determining significance
A statistical distribution is not known for network centrality scores or ReliefF importance scores for calculation of statistical significance. Thus, we implement two approaches for setting statistical thresholds for significant gene associations from the centrality and importance scores, namely, a permutation approach and a mixture model approach.
Permutation algorithm
The permutation algorithm to determine significance is as follows:

Compute observed data importance scores for all genes.

Permute the data class labels mPerm times and accumulate an array of mPerm scores for each gene.

Find the 95^{th} percentile score threshold for each gene’s permutation score array.

Compare the observed score of each gene (unpermuted data) with its permutation threshold.

Count the gene as a significant association if the gene’s score exceeds the threshold, else gene is nonsignificant
Mixture model algorithm
Due to the computationally intense nature of permutation, we also propose a twomode Gaussian mixture model (GMM) clustering approach to the centrality, ReliefF and Random Forest scores to determine whether a gene’s score comes from the null density or the alternative density. We assume the density of scores comes from a linear combination of Gaussian distributions, which we find is a reasonable reflection of the scores. We use expectation maximization to estimate the parameters for the null and alternative densities. For each gene’s score, we compute the likelihood that the score belongs to the null density and the likelihood that the score belongs to the alternative density. The gene is classified as a significant (or nonsignificant) depending on whether the likelihood for belonging to the alternative is greater (or less) than the null likelihood.
Network simulation strategy
For this study, we develop a strategy for simulating case–control microarray data with differentially coexpressed genes and differentially expressed genes (outlined in Figures 1 and 2). The strategy builds a dataset with a baseline correlation network structure followed by “random attacks” of genes in the disease/cases group to disrupt coexpression of the attacked genes’ connections. Thus, differential coexpression emerges through the random disruption of the underlying correlation structure. The first step involves simulating an initial connectivity, encoded as an undirected adjacency matrix. This adjacency matrix is the wiring of the healthy control coexpression, and it is the starting point for rewiring the disease (cases) coexpression. We constrain the initial adjacency matrix to have one of two degree distributions: scalefree or ER. For the scalefree simulations, we use the preferential attachment algorithm [4]. While there is a great deal of evidence for scalefree networks in biology, it is not clear that this is always the case for coexpression networks. Thus, as an alternative we test network construction and centrality feature selection algorithms for simulated ER networks, which use a uniform probability to determine whether or not nodes are connected. An example of a scalefree network is shown in Figure 1 (step 1) with the corresponding powerlaw degree distribution.
For a given set of simulation parameters, we create 100 replicate data sets. The simulation parameters are the number of permuted genes that cause differential coexpression (n), the total number of genes (N), the sample size (M = cases + controls), the noise (standard deviation) in the correlation between genes, and fold change. We introduce two types of randomness for each replicate: noise in the theoretical correlation structure and the genes selected for permutation. For the total number of genes we use N = 100. Thus, in practice, we assume filtering is performed agnostic to outcome such as low value and low variance. We consider sample sizes (M = cases + controls) from the relatively small 20 (10 cases and 10 controls) to the more moderate M = 40 samples. The amount of Gaussian noise (standard deviation) added to correlated genes ranges from .05 (strong correlation) to 3 (weak correlation). We also generate simulations that contain differential expression (main effects) with fold change up to 2fold. A standard deviation of the gene intensity measurements on the basetwo logarithmic scale value of 0.7 is realistic for genes that are expressed at moderate to high levels [22].
Performance metrics
For simulated data with N total genes and n perturbed genes, we define the true positive rate (TPR) as the number of genes with centrality scores assigned to the high GMM mode that were also perturbed (true positives) divided by the total number of perturbed genes (m positives). Similarly we define the false positive rate (FPR) as the number of genes with centrality scores assigned to the high GMM mode that were unperturbed in the simulated data (false positives) divided by the total number of genes that were unperturbed (Nn). Performance evaluation works similarly for the permutation test approach for identifying significant centrality weights.
Microarray data
We apply ReliefF, Lasso, dcGAIN and reGAIN network construction with CINC to a publicly available influenza vaccine dataset (GEO: GSE29619). We adjust for sex in the reGAIN models, which cannot be done easily with the other methods. In this study, 28 healthy adults were vaccinated during the 2008 influenza season with trivalent inactivated influenza vaccine (TIV) while measuring gene expression levels before and 7 days after vaccination [19]. Antibody titers against the influenza virus were recorded before and 28 days after vaccination. For our statistical analyses, we employed the day 7 versus baseline gene expression change as the predictors and high and low antibody titers at day 28 as the phenotype as defined in Ref. [19]. Similar to the goal of the original study, our application seeks to identify genetic effectors associated with differential immune response. However, while the previous study only analyzed the individual effects of gene expression on the antibody titer phenotype, our combined network approach accounts for both individual and interactive effects when prioritizing genetic effects.
Results
Simulation analysis
The main goal of this study is to test the performance of methods to identify genes that are involved in changes in coexpression between groups. Thus, we compare the true positive rates (TPR) and false positive rates (FPR) for detecting the 10% of genes that were targeted in the simulated differential coexpression data. The models include varying amounts of network correlation noise (standard deviation) and either 20 or 40 samples (balanced cases and controls). We calculated the TPR and FPR for each method across 100 replicates for each model. For scalefree differential network simulations and the permutation method for assessing significance (Figure 3), we find the following TPR order (highest to lowest): dcGAIN + CINC, ReliefF, reGAIN + CINC, Random Forest and Lasso. The FPR for all methods are very low for all methods using permutation (Figure 3). When using GMM to determine significance (Figure 4), we find higher TPR for all methods with the same relative order as with permutation. However, permutation has the advantage of lower FPR compared with the GMM. For simulated ER networks, the methods have lower TPR than their analysis of scalefree differential coexpression networks but also slightly lower FPR (Figures 5 and 6). Lasso has very low TPR for all differential network simulations because the Lasso only includes main effect terms.
In addition to differential coexpression effects, we expect many genes to show individual (main effect) differential expression between groups. Thus, we compare the TPR and FPR of the panel of methods to detect main effects. The simulations are similar to the differential coexpression simulations but instead of varying the correlation noise, we increase the effect size of the 10% of main effect genes up to 2fold. For the permutation testing to determine significance (Figure 7), we find the order of TPR performance is ReliefF, Random Forest, reGAIN + CINC, Lasso, and dcGAIN + CINC. The main effect TPRs for all methods are similar with the exception of dcGAIN + CINC, which is distinctly lower than the others. Using GMM for significance in the main effect simulations (Figure 8), the TPRs are higher than the permutation testing approach (Figure 7) but follow the same trend. As is the case for the interaction simulations with GMM and permutation (Figures 3 and 4), the main effect simulations with GMM tend to have higher FPR (Figure 8) than permutation testing (Figure 7). While dcGAIN + CINC performs best for the interaction simulations, it performs worst for main effects. ReliefF shows the best overall performance at detecting both main effects and interaction effects. As with interaction simulations, these methods show higher TPR when using GMM, but permutation has the advantage of lower FPR. For computational expediency, we simulate only 100 genes with 10 target genes for each replicate because we create 100 replicate data sets for each simulation scenario and we use permutation testing in many cases (Figures 3, 5 and 7). However, we show for real data that all probes can be analyzed, though in practice filtering is recommended.
Microarray analysis of differential immune response to influenza vaccine
We apply the feature selection methods reGAIN + CINC, dcGAIN + CINC, ReliefF and Lasso to a publicly available influenza vaccine dataset (GEO: GSE29619), results in Additional file 1: Table S1. For the interaction network based methods, we first filter the gene expression probes to the top 5,000 probes based on univariate regression pvalues, though genes with no univariate effect could be implicated through only interactive effects. Since this filter is not aggressive, we do not cross validate this step even though it was not agnostic to outcome. After filtering the original probes, we apply dcGAIN and reGAIN to the 5,000 remaining transcripts. After each network was constructed, we apply the CINC centrality method to identify the most significant hubs and other effects in each network.
Significant genes for reGAIN + CINC and ReliefF show enrichment of PI3Krelated pathways. Reactome pathways enriched in ReliefF and reGAIN include “Genes involved in Negative regulation of the PI3K/AKT network” (6.3e4) and “Genes involved in PI3K cascade” (1.72e3). The specific genes found include PIK3R5 and AKT3 – found by both reGAIN and ReliefF – and PTEN and FGF23, found by ReliefF only. The PIK3R5 gene was also selected by Lasso but not dcGAIN, indicating a consensus among methods for the main effect of this gene. This PI3K pathway signature is biologically relevant to differences in influenza vaccine immune response because loci in the PIK3CD gene are associated with an immunodeficiency syndrome that presents with recurrent respiratory infection, increased circulating transitional B cells, and impaired vaccine response. A recent study found a gainoffunction rare variant (nonsynonymous) in PIK3CD for the syndrome and increased levels of phosphorylated AKT protein from patientderived lymphocytes [23].
Differential coexpression hub analysis (Figure 9) reveals that PIK3CD has a strong negative hub effect but very low main effect on immune response. This type of epistasis analysis tool is similar to that used for visualizing interaction effects from double mutant strains of yeast [24]. A negative interaction for the vaccine immune response outcome represents a joint effect that leads to decreased immune response to the vaccine. A negative hub is a gene whose sum of negative interactions outweighs its positive interactions; such genes fall below the null line (Figure 9). In addition to global interaction effects, the positive/negative hub plot also includes main effect information based on size and color of the plot symbol. For example, PIK3CD, which has been shown to affect immune response to vaccination, displays an important effect as a negative differential coexpression hub. However, PIK3CD has a negligible effect by univariate analysis of this microarray study, indicated by the small plot symbol for the gene.
Our simulation results show that dcGAIN was the highest performer at finding interactions but worst at finding main effects. This is corroborated in the data analysis in which the dcGAIN + CINC results have the smallest intersection of genes with Lasso (only 1 out of 17 genes). Whereas ReliefF find 13/17 and reGAIN + CINC find 5/17 main effect genes. Thus, many of the genes for differential immune response to influenza vaccine found by dcGAIN are likely due to substantial interaction effects.
For example, one of the dcGAIN hubs is the killer cell lectinlike receptor G1 (KLRG1), which has been used to define populations of senescent effector CD8 T cells in mice and humans [25]. Additionally, influenza virusspecific CD8 T cells showed a decrease in functionality corresponding to increases in KLRG1 [26]. Moreover, another interaction hub identified by the dcGAIN + CINC approach was the cyclindependent kinase 13 (CDK13) gene, which has been linked to increase viral production [27]. Taken together, these observations indicate that a differential correlation network structure has the power to uncover biological effectors that implicate both general and specific immune processes that other statistical methods do not uncovered. In addition to the PI3K pathway, the novel link to general viral regulator gene (CDK13) and a specific influenza T cell gene using transcriptomics (KLRG1) demonstrate the utility of our proposed framework.
Discussion
Most gene expression analyses focus on identifying genes or sets of genes that show differential expression between phenotypes based on a univariate statistic. However, the coexpression between two genes may be conditional on the phenotypic or biological context. In other words, pairs of genes may be differentially coexpressed, whereby the wiring between two genes in a healthy or homeostatic network switches or is disrupted in a disease or perturbed network. Furthermore, influential gene hubs that discriminate between phenotype may be identified through the agglomeration of the univariate and pairwise interactions in a condition specific gene network model. We used two methods for estimating the edges in these genetic association interaction networks (GAIN): Fisher ztest for differential correlation (dcGAIN) and a GLM regression model approach with genegene interaction terms (reGAIN). We applied our interaction network centrality algorithm (CINC, Eq. 5, a generalization of SNPrank for GWAS) to identify important susceptibility hubs and candidate genes for network rewiring. In addition, we compared network feature selection methods with ReliefF, Random Forests, and Lasso.
In order to assess the effects of correlation structure in a controlled way, we introduced a differential coexpression network simulation strategy that incorporated realistic network structures such as scalefree and ER, and we used random mutation to induce differential coexpression. As expected, Lasso was unable to detect the simulated differential coexpression effects because we did not include interactions in the Lasso. The other methods, which model conditional dependence between genes, tended to have greater power to detect susceptibility genes in the scalefree networks compared with ER. For reGAIN and dcGAIN, this increased power in scalefree networks is partly attributable to CINC centrality; centralities are sensitive to hub effects, and scalefree networks are characterized by hubs, whereas nodes in ER networks have uniform degree on average. Regardless of the underlying network degree distribution, dcGAIN had the highest power to detect the differential coexpression effects, followed by reGAIN and ReliefF – which had very similar performance – and finally Random Forest had the lowest power. We have shown previously that Random Forest is limited in its ability to find genegene interactions [28]; however, it performed reasonably well in these interaction simulations because we limited the number of simulated background genes.
Although our motivation for using network and machine learning approaches was to detect additional variation due to interaction effects, for completeness we also tested these approaches on main effect simulations. While dcGAIN performed best for differential coexpression interactions, it had the lowest power and highest false positive rate than the other methods for main effects. The test statistics for differential correlation on the offdiagonal tended to be larger than the test statistics for the main effect tests on the diagonal in these simulations, even in the absence of differential correlation. It may be that the larger number of offdiagonal differential correlation terms (n (n1)/2 of them) masks the smaller number of main effect terms (n of them) in the CINC statistic. This discrepancy between dcGAIN interaction and main effect detection perhaps may be addressed by bringing dcGAIN into a regression framework. The main effects pose less difficulty for reGAIN, which uses regression coefficients for the diagonal and off diagonal. ReliefF performs best for main effect simulations, which, coupled with its relatively good performance on interactions, suggests ReliefF is a good allpurpose filter.
One of the challenges for modelfree feature selection methods, like network centralities or ReliefF, is determining the statistical significance of feature scores. Thus, we introduced two methods to assess the statistical significance of the CINC centrality scores for dcGAIN and reGAIN and for ReliefF and Random Forest importance scores: a mixture model approach and permutation testing. The mixture model approach tends to give greater power and greater computational speed, but at the expense of more false positives than the permutation approach. On a related note, we do not use permutation when calculating the GLM interaction models of reGAIN, but instead we use the usual pvalues and standardized beta coefficients. It has been shown for genetic data that permutation must be implemented carefully to handle the simultaneous interaction and main effect null hypotheses [29].
The Fisher ztest used in dcGAIN is designed to find correlation differences between groups, which made it better than reGAIN at finding the differential coexpression effects constructed in our simulations. However, the GLM framework used by reGAIN can be used for quantitative traits and any phenotype that can be modeled with an exponential family distribution, including timetoevent phenotypes. Further, the GLM framework can adjust for covariates, like sex, which is known to affect immune response. Thus, reGAIN + CINC provides more modeling flexibility plus it balances interaction and main effects better than dcGAIN.
Our application of these interaction network methods to a microarray study of the differential immune response to influenza vaccine found novel markers that are missed by main effect analysis. From our network centrality and machine learning analysis, we identified PI3K related genes, which have been previously demonstrated to be effectors in human immune responses [23] but would be missed by a univariate analysis of the influenza vaccine microarray study. Similarly, we found a differential coexpression network effect for the KLRG1 gene, which plays a role in immunosenescence in influenza virus specific CD8 T cells [26], and the CDK13 gene, which is associated with viral production [27], but a conventional differential expression analysis does not identify these genes as an effector. Ultimately, our findings implicate novel effectors in viral activity and validate previously identified influenza effects through transcriptomics analyses. The identification of these biomarkers combined with our simulated results demonstrate the advantages of using machine learning and differential coexpression network centrality to augment univariate approaches to identify functional effectors in microarray data.
References
 1.
Langfelder P, Mischel PS, Horvath S. When is hub gene selection better than standard metaanalysis? PLoS One. 2013;8:e61505.
 2.
Zhang J, Xiang Y, Ding L, KeenCircle K, Borlawsky TB, Ozer HG, et al. Using gene coexpression network analysis to predict biomarkers for chronic lymphocytic leukemia. BMC Bioinformatics. 2010;11 Suppl 9:S5.
 3.
de la Fuente A. From ‘differential expression’ to ‘differential networking’  identification of dysfunctional regulatory networks in diseases. Trends Genet. 2010;26:326–33.
 4.
Barabasi AL, Albert R. Emergence of scaling in random networks. Science. 1999;286:509–12.
 5.
He X, Zhang J. Why do hubs tend to be essential in protein networks? PLoS Genet. 2006;2:e88.
 6.
Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The largescale organization of metabolic networks. Nature. 2000;407:651–4.
 7.
Jeong H, Mason SP, Barabasi AL, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411:41–2.
 8.
Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA. Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol. 2004;14:283–91.
 9.
Davis NA, Crowe Jr JE, Pajewski NM, McKinney BA. Surfing a genetic association interaction network to identify modulators of antibody response to smallpox vaccine. Genes Immun. 2010;11:630–6.
 10.
Davis NA, Lareau CA, White BC, Pandey A, Wiley G, Montgomery CG, et al. Encore: Genetic Association Interaction Network Centrality Pipeline and Application to SLE Exome Data. Genet Epidemiol. 2013;37:614–21.
 11.
Pandey A, Davis NA, White BC, Pajewski NM, Savitz J, Drevets WC, et al. Epistasis network centrality analysis yields pathway replication across two GWAS cohorts for bipolar disorder. Transl Psychiatry. 2012;2:e154.
 12.
McKinney BA, White BC, Grill DE, Li PW, Kennedy RB, Poland GA, et al. ReliefSeq: a genewise adaptiveK nearestneighbor feature selection tool for finding genegene interactions and main effects in mRNASeq gene expression data. PLoS One. 2013;8:e81527.
 13.
Park S, Lehner B. Epigenetic epistatic interactions constrain the evolution of gene expression. Mol Syst Biol. 2013;9:645.
 14.
Allison DB, Gadbury GL, Moonseong H, Fernandex JR, Lee CK, Prolia TA, et al. A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal. 2002;39:1–20.
 15.
Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of pvalues. Bioinformatics. 2003;19:1236–42.
 16.
Kononenko I. Estimating Attributes: Analysis and Extensions of RELIEF. Eur Conf Mach Learn. 1994;l:171–82.
 17.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
 18.
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–95.
 19.
Nakaya HI, Wrammert J, Lee EK, Racioppi L, MarieKunze S, Haining WN, et al. Systems biology of vaccination for seasonal influenza in humans. Nat Immunol. 2011;12:786–95.
 20.
McCullagh P, Nelder JA. Generalized linear models. Monographs on statistics and applied probability. London. New York: Chapman and Hall; 1983.
 21.
Cohen J, Cohen P. Applied multiple regression/correlation analysis for the behavioral sciences. Hillsdale, NJ: L. Erlbaum Associates; 1983.
 22.
Ballman KV. Genetics and genomics: gene expression microarrays. Circulation. 2008;118:1593–7.
 23.
Angulo I, Vadas O, Garcon F, BanhamHall E, Plagnol V, Leahy TR, et al. Phosphoinositide 3kinase delta gene mutation predisposes to respiratory infection and airway damage. Science. 2013;342:866–71.
 24.
Bandyopadhyay S, Mehta M, Kuo D, Sung MK, Chuang R, Jaehnig EJ, et al. Rewiring of genetic networks in response to DNA damage. Science. 2010;330:1385–9.
 25.
Voehringer D, Koschella M, Pircher H. Lack of proliferative capacity of human effector and memory T cells expressing killer cell lectinlike receptor G1 (KLRG1). Blood. 2002;100:3698–702.
 26.
Dolfi DV, Mansfield KD, Polley AM, Doyle SA, Freeman GJ, Pircher H, et al. Increased Tbet is associated with senescence of influenza virusspecific CD8 T cells in aged humans. J Leukoc Biol. 2013;93:825–36.
 27.
Berro R, Pedati C, KehnHall K, Wu W, Klase Z, Even Y, et al. CDK13, a new potential human immunodeficiency virus type 1 inhibitory factor regulating viral mRNA splicing. J Virol. 2008;14:7155–66.
 28.
McKinney BA, Crowe JE, Guo J, Tian D. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet. 2009;5:e1000432.
 29.
Buzkova P, Lumley T, Rice K. Permutation and parametric bootstrap tests for genegene and geneenvironment interactions. Ann Hum Genet. 2011;75:36–45.
Acknowledgements
We would like to thank Steven H. Kleinstein and Gregory A. Poland for helpful discussions.
Funding
Research reported in this publication was supported by the National Institute of Allergy And Infectious Diseases of the National Institutes of Health, Department of Health and Human Services under award number U01IOFAI89859.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BAM, BCW, ALO and CL contributed to the design of the study. BAM, ALO and CL wrote the manuscript. BCW, BAM, and CL performed statistical analyses. All authors read and approved the final manuscript.
Additional file
Additional file 1: Table S1.
Significant genes for high versus low HAI in baseline versus day7 gene expression following TIV 2008 influenza vaccination (GSE29619).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lareau, C.A., White, B.C., Oberg, A.L. et al. Differential coexpression network centrality and machine learning feature selection for identifying susceptibility hubs in networks with scalefree structure. BioData Mining 8, 5 (2015). https://doi.org/10.1186/s130400150040x
Received:
Accepted:
Published:
Keywords
 Gaussian Mixture Model
 Lasso
 Influenza Vaccine
 True Positive Rate
 Importance Score