Functional networks inference from rule-based machine learning models

Background Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods. Results We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process. Availability The implementation of our network inference protocol is available at: http://ico2s.org/software/funel.html Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0106-4) contains supplementary material, which is available to authorized users.


Classification accuracy under feature selection
To choose the default percentage of attributes retained in the feature selection procedure, we performed a preliminary analysis using all 8 transcriptomic datasets from the main article. We evaluated how the Bio-HEL classification accuracy changes with the number of selected features (using linear SVM-RFE). The accuracy was measured using a standard 10 cross-fold validation. The full experiment (not reported here) used 100%, 90%, 80%, ..., 10% of the original dataset attributes.
We found that even when only 10% of the attributes are retained, the classification accuracy remains almost unchanged. Specifically, with 10% of the original attributes the accuracy increased for 2 datasets, slightly decreased for 3 datasets and remained unaltered for the other datasets. The exact results are reported in Table S1 below: Table S1: BioHEL classification accuracy for each dataset, in 10-fold cross-validation experiments on the original and reduced set of attributes (before and after the feature selection). Linear SVM-RFE was used to select best 10% of the attributes. Given that we were able to maintain good classification accuracy despite large reduction in number of used attributes, we decided to use 10% attributes as a default setting for the FuNeL feature selection procedure. However, this FuNeL parameter is under the user control and the default setting can be changed.
The running time for the whole pipeline depends on the rule set generation time (execution time of BioHEL), as the optional feature selection stage can be seen as running in constant time. Two main factors that influence the rule set generation time are: (1) the number of attributes and (2) the number of samples.
We performed an execution time analysis of BioHEL using the largest (in terms of number of attributes) Colon-Breast dataset (Chowdary et al., 2006). In the feature selection stage we retained: 20, 200, 2000, 10 000 and 20 000 attributes. From each of these 5 datasets we generated 100 random subsets of 50, 40, 30, 20 and 10 samples. Finally, we ran BioHEL 1000 times to obtain 1000 rule sets for each dataset. Figure S1 shows the running times averaged across 100 000 runs (1000 runs for each of the 100 datasets).
The total execution time of FuNeL configurations C 1 and C 2 is calculated as: T 1 = (rule_sets × t(atts 1 , samples)) + (permutation_runs × t(atts 1 , samples)) (1) where rule_sets is the number of inferred rule sets, permutation_runs is the number of randomised datasets used in the permutation test and t(atts 1 , samples) represents execution time of a single BioHEL run, that linearly depends on the size of a dataset measured in number of attributes and samples.
Configurations C 3 and C 4 require an additional run of BioHEL (step 4), and their total execution time is: where atts 2 is the number of attributes after the permutation test (atts 1 ≤ atts 2 ).
It is important to notice that each run of BioHEL is independent, thus the generation of the rule sets can be trivially parallelised without any extra overhead. Given n computational cores, the total execution times could be reduced to:

Comparison of networks topological properties
The network topology refers to the spatial arrangements of its elements. The analysis of topological properties tells us how different nodes are connected to each other and how their communication paths look like. There are many aspects and characteristics that can be evaluated in a network. For simplicity we report just four metrics: number of nodes, number of edges, clustering coefficient and diameter.
The clustering coefficient is a measure of degree to which nodes in a network tend to cluster together. It expresses the likelihood that any two nodes with a common neighbour are themselves connected. The diameter indicates the maximum distance between two nodes in the network.
We compared the topology of the networks built with two different approaches: co-prediction and co-expression.
For each generated network we calculated the topological properties described above. We report the comparison of FuNeL networks with co-expression networks inferred with different methods. The results considering Pearson are reported in Table S2, ARACNE is reported in Table S3 while MIC analysis is shown in Table S4. Table S2: Topological properties for co-prediction and Pearson co-expression networks generated for all 8 datasets.

Co-prediction Co-expression (SE) Co-expression (SN)
Dataset Cat.  Table S3: Topological properties for co-prediction and ARACNE co-expression networks generated for all 8 datasets.

SE(C 1 ) SE(C 2 ) SE(C 3 ) SE(C 4 ) SN (C 1 ) SN (C 2 ) SN (C 3 ) SN (C 4 )
Leukemia Nodes  421  1480  294  988  1024  1426  1356  1577  422  1480  294  989  Edges  1529  2294  2154  2646  1529  2294  2154  2646  512  2416  327   When analysing FuNeL networks we observed, as expected, that configurations having feature selection (C 1 and C 3 ) lead to networks with a smaller number of nodes than when the original set of attributes is used (C 2 and C 4 ). Furthermore, the second phase of machine learning modeling (C 3 and C 4 ) tends to reduce the number of nodes as it uses a reduced set of attributes as input (only significant nodes and their neighbours from the first training phase), while increasing both clustering coefficient and number of edges.
When comparing FuNeL and co-expression networks we notice that the ARACNE SE counterparts have in general more nodes. The same patter can be found in SE(C 2 and SE(C 4 ) counterparts generated with Pearson and MIC, while it's not true for the SE-networks based on configurations that use feature selection (C 1 and C 2 ). Conversely, SN -networks differ according to the inference method used. In fact ARACNE generated SN counterparts with less edges, while this is true only for SN (C 1 ) and SN (C 3 ) inferred with MIC and Pearson. The clustering coefficient is constantly lower in ARACNE networks than in FuNeL, this is probably due to the pruning phase operated by the method. A similar trend can be noticed for MIC networks with some exceptions (e.g Prostate SN (C 2 ) and SN (C 4 )). A more balanced situation occurs when FuNeL is contrasted with Pearson, in fact networks generated with feature selection (C 1 and C 3 ) have a lower coefficient than their co-expression counterparts. Finally a clear pattern emerge when analysing the diameter of the networks. Co-prediction networks are always more compact than co-expression counterparts having up to 3 time lower diameter for MIC and Pearson and up to 7 time lower for ARACNE.

Enrichment score analysis
In this section we report the network average rankings, based on the Enrichment Score, across the 8 datasets for each inferring method. The networks are ranked between 1 and N (where N = 4 for FuNeL and N = 8 for Pearson, ARACNE and MIC: 4 SE(C i ) + 4 SN (C i )). We considered Gene Ontology terms (biological process (BP), molecular function (MF) and cellular component (CC)) and biological pathways. The last row of each table represents the average rank across different biological categories.

MIC (SE) MIC (SN)
Cat. In here we also report the results of the analysis where we compared each inferring method against FuNeL using the Enrichment Score. The networks are ranked from 1 to 12: 4 C i + 4 SE(C i ) + 4 SN (C i ). In Table S9 we report, for each biological category and for each network, the ranks averaged across the 8 datasets. The row-wise rank is given in brackets and the highest ranks are shown with bold font. The following abbreviations were used for GO categories: biological process (BP), molecular function (MF) and cellular component (CC).

Disease association analysis
In this section we report the network average rankings across the 8 datasets for every inferring method based on the gene-disease association properties: participation in triangular relationship and proximity. We used two sources for the disease associations: Malacards (Rappaport et al., 2013) (a meta-database of human maladies consolidated from 64 independent sources) and manually curated databases (OMIM (Hamosh et al., 2002), Orphanet (Orphanet, 1997), Uniprot (Magrane and Consortium, 2011) and CTD (Davis et al., 2014)). The networks are ranked between 1 and N (where N = 4 for FuNeL and N = 8 for Pearson, ARACNE and MIC: 4 SE(C i ) + 4 SN (C i )). The number of disease-associated genes participating in a triangle is denoted as 1A, 2A and 3A. The last row of each table represents the average rank across different metrics.

MIC (SE) MIC (SN)
Cat. In this section we report the additional results from the analysis performed using the prostate dataset (Singh et al., 2002) as a case study. In particular we show: 1) the overlap of enriched terms between co-prediction and co-expression networks, 2) the overlap between GO terms associated to the hubs of the networks generated with different methods and FuNeL and 3) the average percentages of alteration for key nodes of both co-prediction and co-expression networks in an independent dataset.

Overlap of networks enriched terms
We performed an analysis on the enriched terms of each network to highlight the complementary nature of the co-prediction and the co-expression paradigm. We generated heatmaps showing the unique terms associated only to co-prediction or co-expression networks. The main manuscript includes the comparison between FuNeL and Pearson networks, in here we report the analysis performed considering ARACNE ( Figure S2) and MIC ( Figure S3) networks. For the sake of readability we filtered out the generic GO terms (with depth < 9 in the GO hierarchical structure). Figure S2: Number of non-common enriched GO terms (biological process) for each network configuration (generated from the prostate cancer dataset). On the x-axis we show the 12 investigated networks. On the y-axis we show the names of enriched terms unique to co-prediction or ARACNE co-expression networks. Red terms are associated with co-expression networks, blue with co-prediction. Empty columns indicate networks with no unique terms. Figure S3: Number of non-common enriched GO terms (biological process) for each network configuration (generated from the prostate cancer dataset). On the x-axis we show the 12 investigated networks. On the y-axis we show the names of enriched terms unique to co-prediction or MIC co-expression networks. Red terms are associated with co-expression networks, blue with co-prediction. Empty columns indicate networks with no unique terms.
When comparing ARACNE and FuNeL, we found 16 unique pathways for co-prediction networks and 8 for co-expression. In terms of unique GO terms, the overlap was more balanced, 7 for co-prediction networks and 9 for co-expression networks. C 2 and C 4 , generated without feature selection, had the largest number of unique pathways, while SE(C 2 ) had the highest number of terms for ARACNE. The comparison of FuNeL with MIC generated many empty columns ( Figure S3) for the GO terms because several networks resulted having no unique enriched terms. All the 15 unique GO terms related to MIC were associated to SN (C 2 ) (and with SN (C 4 ) in two cases), conversely FuNeL had more networks sharing the 12 unique terms. Finally, as noticed for in the ARACNE comparison, FuNeL networks are more enriched in biological pathways: 16 against 8 unique terms for MIC co-expression.

Overlap of hubs related terms
We also analysed the gene associated to the hubs of each network in order to compare the biological knowledge associated to them. A node v was considered to be a hub if its degree was at least one standard deviation above the mean network degree, that is if: where d(v) is a degree of the node v, and µ d and σ d are the mean and standard deviation of a network node degree distribution.
To compare the networks, we used the 10 most frequent GO terms (biological processes) shared among each network's hubs. Figure S4, S5 and S6 show the terms-overlap analysis between FuNeL networks and Pearson, ARACNE and MIC respectively. To make this analysis more specific we have discarded the most generic / most common terms (which could be be associated with many genes), we considered only the GO terms situated at level 10 of the GO hierarchy or lower. Blue terms were found only in co-prediction networks, red terms were found only in co-expression networks, and green terms were found in both. In Table S18 we summarise the number of unique and common terms shared between networks created with different approaches. This analysis further highlights the complementary nature of co-prediction and co-expression approach, the terms that are paradigm-specif always outnumber the common ones.

Validation on independent dataset
In this section we report additional informations about the analysis performed using data from the independent prostate cancer study (Taylor et al., 2010) available in the cBioPortal for Cancer Genomics (Cerami et al., 2012). In particular we report the full list of alterations for the topologically important genes analysed in the main article. The Figure S7-S14 show the percentage of altered tumour samples for top 10 hubs (nodes with highest degree) and top 10 central nodes (with highest betweenness centrality) in the best performing networks according to the gene-disease association analysis (using the information from the curated databases). The selected networks are C 2 for FuNeL, SN (C 3 ) for Pearson, SE(C 4 ) for ARACNE and SE(C 2 ) for MIC. For all of them we report the alterations for both hubs and central nodes.

Genetic Alteration
Deep Deletion Missense Mutation mRNA Upregulation mRNA Downregulation Figure S7: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest degree (hubs) in C2 network are shown.

Genetic Alteration
Deep Deletion Missense Mutation mRNA Upregulation mRNA Downregulation Figure S8: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest betweenness centrality (central nodes) in C2 network are shown.

Genetic Alteration
Amplification Deep Deletion mRNA Upregulation mRNA Downregulation Figure S9: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest degree (hubs) in Pearson SN (C3) network are shown.

Genetic Alteration
Amplification mRNA Upregulation mRNA Downregulation Figure S10: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest betweenness centrality (central nodes) in Pearson SN (C3) network are shown.

Genetic Alteration
Amplification mRNA Upregulation mRNA Downregulation Figure S11: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest degree (hubs) in ARACNE SE(C4) network are shown.

Genetic Alteration
Amplification mRNA Upregulation mRNA Downregulation Figure S12: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest betweenness centrality (central nodes) in ARACNE SE(C4) network are shown.

Genetic Alteration
Amplification Deep Deletion mRNA Upregulation mRNA Downregulation Figure S13: Percentage of alterations in tumour samples from an independent cancer genomic study. Genes with highest degree (hubs) in MIC SE(C2) network are shown.