We present an approach to identify genomic regions that remain stable despite slight perturbations to the underlying data. We begin by creating a regularized machine learning classifier to predict a disease phenotype from genomic regions; then, we implement an algorithm based on maximum flow to identify and extract stable features. We apply our technique to ASD and validate our findings.
Measuring feature stability
We begin by constructing a binary feature matrix to encode the genome, with rows representing samples and columns corresponding to genomic regions. Genomic regions can encompass a variety of features, such as single nucleotide polymorphisms (SNPs), variants, or genes; in this analysis, we focus specifically on variants, which include both SNPs and short indels. In order to generate a binary representation, we assigned each cell a 1 if the sample expressed a homozygous alternate or heterozygous variant and a 0 if the base pairs at the site matched the reference sequence.
Then, we create a training set containing 80% of the samples in the dataset. A model with stable features will be resistant to slight perturbations in the underlying data, which increases confidence that top-ranked features are correlated with the phenotype [30, 31]. In a machine learning setting, cross-validation can be utilized as a method for perturbing the underlying data used to train a model and obtaining stability measurements [32]. In order to estimate feature stability, we perform k-fold cross validation across the training set with an L1-regularized logistic regression classifier. We select k = 5 in this work. For each of the five validation folds, we extracted the list of all variants with non-zero coefficient scores.
Three metrics were used to characterize stability: (1) Pearson correlation coefficient, which measures the similarity between the coefficient scores assigned to each variant by the logistic regression model, (2) Kendall-Tau scores, which calculate the correlation between the ranked variant lists, and (3) Jaccard similarity index, which measures overlap between two sets of features [32, 33]. Given two lists of features A and B with associated coefficient scores w and v, the Pearson correlation coefficient is computed as \(\left[{\sum}_i\left({w}_i-\overline{w}\right)\left({v}_i-\overline{v}\right)\right]/\sqrt{\sum_i{\left({w}_i-\overline{w}\right)}^2{\sum}_i{\left({v}_i-\overline{v}\right)}^2}\), with values ranging between − 1 (perfect negative correlation) and + 1 (perfect positive correlation). Kendall-Tau scores are computed as P-Q/(P + Q), where P represents the number of concordant pairs and Q represents the number of discordant pairs when the features are arranged in rank order, and range between − 1 and 1. The Jaccard index is computed as ∣A ∩ B ∣ / ∣ A ∪ B∣, with values ranging between 0 (no overlap) and 1 (identical). We computed these metrics for all ten pairwise comparisons of feature lists. This presents an initial metric of stability prior to implementation of our maximum flow formulation (Fig. 4).
Maximum flow formulation
In this section, we discuss the maximum flow algorithm, which is implemented as a post-processing step to extract stable variants from the outputs of the regularized machine learning model created in the previous section. Biological datasets often present the characteristic of feature interdependence, which can enable informed dimensionality reduction based on domain knowledge [34]. In this work, we use the presence of LD to group correlated variants identified as predictive by the 5-fold cross validation performed in the previous section. Such an approach ensures that the model remains resistant to perturbations in training data and outputs interpretable results that can be utilized to measure the contribution of individual variants to disease phenotype. Features identified by this method are stable variants that are potentially associated with the manifestation of ASD.
LD is a genetic phenomenon that results in non-random correlation between a group of variants or alleles. A pair of variants is said to be in LD if the observed frequency of a particular haplotype in the genome is higher than the expected frequency [35]. LD is typically measured using the R2 metric, which ranges from 0 to 1, with larger values representing stronger association between variants [36]. It is important to note that measurements of LD in a sample population tend to be sensitive to the presence of rare variants, since limited data affects resulting scores.
We hypothesize that the presence of LD accounts for model instability; it is likely that the classifier identifies the same predictive regions in each fold yet extracts different variants, creating the appearance of instability as observed in the results from the previous section. We now discuss a novel formulation of the maximum flow algorithm, in which we utilize the presence of LD to identify regions that are consistently present across folds; this results in a set of stable variants that are likely to be correlated with the phenotype.
The 5-fold cross validation performed in the previous section results in five sets of variants with non-zero coefficient scores. We begin by creating a graph G to represent the presence of LD between pairs of these variants (4). Each node n in the graph is defined by a variant v as well as the fold in which it occurs f, which we represent as the tuple n = (v,f). Consider a pair of nodes n1 = (v1, f1) and n2 = (v2, f2); an edge is drawn between the pair if the following criteria are satisfied: (1) n1 and n2 are present in neighboring folds such that f2 = f1 + 1 and (2) n1 and n2 are in linkage disequilibrium as indicated by the R2 value between v1 and v2 exceeding 0.8 [36]. As a result of criterion (1), G is necessarily a 5-partite graph. Variants closer together in physical space are more likely to be in LD, so in this analysis, we compute LD within each chromosome; thus, an edge exists between nodes n1 and n2 if and only if v1 and v2 are located within the same chromosome. Consequently, G is composed of 23 disjoint subgraphs (Fig. 5).
We now utilize maximum flow to identify stable variants across folds [37]. We restructure G = (N,E) into a directed, acyclic flow network L = (N,E) such that it is amenable to the maximum flow formulation, a concept well studied in graph theory. To do so, we add a source node s and a sink node t to the graph. A directed edge is drawn between the source s and all nodes with a fold value of f = 1. Similarly, an edge is drawn between all nodes with a fold value of f = 5 and the sink t. To create the flow network L, we first define the capacity of an edge as the maximum amount of flow that can pass between nodes n1 and n2, defined as a real number c(n1,n2). We also define the flow between nodes n1 and n2 as a real number l limited by the capacity of the edge, such that l(n1,n2) ≤ c(n1,n2). With the exception of the source and sink nodes, the flow entering a node must equal the flow exiting the node. The total flow value of the graph can be computed as the flow that leaves the source node or enters the sink node, represented by the summation \({\sum}_{\left(s,n\right)\in E}l\left(s,n\right)\) = \({\sum}_{\left(n,t\right)\in E}l\left(n,t\right)\). Then, our goal is to maximize the total flow passing from the source to the sink node of a graph with respect to the criteria defined above.
In a k-partite graph, it is possible that the set of unique edges that define the maximum flow can include non-unique nodes; however, we seek to identify the maximum number of stable nodes across folds in order to identify regions in LD that are likely to influence the phenotype. Thus, we must constrain the flow network such that each node can only be used once within the flow. We do so by adding an additional customization to the network L = (N,E) to constrain flow. We split each node n = (v,f) into two nodes defined as nin = (v,f,in) and nout = (v,f,out), adding a single edge between nin and nout. As a result of this modification, L doubles in size, becoming a 10-partite graph (excluding the source and sink nodes) (Fig. 6).
All edges in L are assigned a capacity of 1. Then, the flow through L is computed using the Ford-Fulkerson algorithm, a greedy path-search technique that identifies a maximal set of valid paths between the source and sink nodes [38]. The runtime of Ford-Fulkerson is O (VE2). The resulting maximum flow value defines the number of valid paths through the graph, and the nodes along each flow path from the source to sink represent a set of regions that remain stable across folds after accounting for the presence of LD.
Data and preprocessing
We evaluated our methodology on 30x coverage whole genome sequence data collected from 2182 children with ASD and 379 control patients with progressive supranuclear palsy (PSP). The ASD population data was obtained from The Hartwell Foundation’s Autism Research and Technology Initiative (iHART), which has collected whole genome sequence data from 1006 multiplex families, with at least two children in each family presenting an ASD diagnosis [39]. Although case and control studies on ASD typically assign unaffected family members as controls, subclinical phenotypes of ASD that are often present in family members can suppress diagnostic signal. In order to overcome this issue, we selected to use a separate outgroup of patients with PSP as the control population. PSP is a neurodegenerative condition that has no etiological overlap with ASD and is generally not heritable. Only one gene is currently known to be linked with the condition. We provide further evidence to support the use of this control group in our prior work [19].
In order to limit batch effects due to sequencing methodologies, the PSP and ASD populations were both sequenced at the New York Genome Center with Illumina HiSeq X instruments. There is no overlap between the cohorts.
Previously, we showed that a regularized machine learning model trained on variants in Simple Repeat Sequences (SRS) was able to successfully differentiate ASD patients from the control group with high accuracy, suggesting that variation in SRS may be predictive of the ASD phenotype [19]. We now utilize the methods described in the previous sections to address the issue of feature instability and extract a robust set of SRS variants potentially correlated with the ASD phenotype.
SRS are segments of noncoding DNA that consist of repeating sequences of one to ten base pairs. These regions are highly susceptible to mutations, and unstable expansion of these regions has been linked to more than twenty neurodevelopmental and neurodegenerative conditions [40, 41]. We downloaded a list of chromosomal coordinates for all SRS from the UCSC Genome Browser, which identified 413,380 SRS regions in the human genome [42]. Variants likely to result from batch effects were removed using a genome-wide association test with batch (ASD and PSP) as the phenotype [19]. After preprocessing, variants present in both the ASD and PSP populations were identified, resulting in a final list of 232,193 SRS variants.
We create a binary feature matrix consisting of 2561 rows (corresponding to the 2182 ASD patients and 379 control patients) and 232,193 columns (corresponding to the SRS variants), which serves as input to the logistic regression classifier. To address class imbalance between the case and control populations, we adjusted classifier weights to be inversely proportional to class sizes. We then execute the maximum flow procedure as described in the previous sections and extract the list of variants determined to be stable across folds.
Validation
We hypothesize that the presence of LD will cause the model to identify the same regions in each fold yet extract different variants. To evaluate this hypothesis, we determine whether the variants identified by the classifier across the five cross-validation folds are more stable than expected by random chance. We perform a bootstrap test, randomly selecting five sets of variants from the complete set of 232,193 SRS variants, maintaining sizes equivalent to those of the original folds. We construct a flow network using the procedure defined in the previous section and use this to compute maximum flow. We repeat this process 100 times. If our hypothesis is supported, we expect the random flow networks to be highly-disconnected due to low linkage between random variants. If our hypothesis is not supported, the random flow networks will result in connected graphs with flow values similar to those observed in our true network.
Next, we determine if a classifier trained on the variants identified by the maximum flow procedure is more stable than the original regularized machine learning model. To do so, we utilize the regions identified by the maximum flow algorithm to construct new feature matrices. Each flow path from the source to sink includes one to five unique variants in LD; this group of variants defines a region that appears to be stable across multiple validation folds. We construct a binary feature matrix with columns corresponding to the grouped regions. Each grouped network feature is assigned a 1 if at least one of those variants is present in the patient and a 0 otherwise. In order to demonstrate an improvement in feature stability, we perform five-fold cross validation across the training set with a logistic regression model trained on this reduced feature set. We recompute stability metrics (Pearson correlation coefficient, Kendall-Tau score, and Jaccard similarity index) for our reduced variant set and compare our results to the initial stability measurements that we obtained prior to implementation of the maximum flow algorithm.
In order to characterize the stability improvements afforded by the maximum flow approach, we conduct a series of experiments on simulated data. We use the make_classification function in the scikit-learn library to generate synthetic feature matrices for 2-class classification; the generated datasets consist of clusters of points distributed around vertices of a hypercube, with an approximately equal number of points assigned to each class [43]. This approach results in small effect sizes for individual features, which is reflective of standard genomic datasets. We divide features into 23 groups of arbitrary sizes in order to represent chromosomes, and we simulate linkage disequilibrium by introducing correlation between pairs of features located in the same chromosome. Then, we explore the effect of four parameters on the effectiveness of the maximum flow algorithm: level of feature correlation (percentage of variants in LD), dataset dimensionality (number of features), type of regularization (L1 or Elastic Net), and type of classifier (logistic regression or linear support vector classifier). For each experiment, we select parameters, compute baseline classifier performance and stability metrics, execute the maximum flow algorithm, and recompute performance and stability values with the smaller subset of features. Each experiment is repeated twenty times with random feature matrices.
Finally, we search for and characterize the enrichment of our highest ranked variants in biological roles associated with the autism phenotype.