On the evaluation of the fidelity of supervised classifiers in the prediction of chimeric RNAs
© The Author(s) 2016
Received: 1 February 2016
Accepted: 11 October 2016
Published: 2 November 2016
High-throughput sequencing technology and bioinformatics have identified chimeric RNAs (chRNAs), raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis.
The task of discriminating true chRNAs from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data may contain false reads due to technical artifacts and during the analysis process, bioinformatics tools may generate false positives due to methodological biases. Moreover, if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it. Like any other machine learning problem, the first big issue is finding the good data to build models. As far as we were concerned, there is no common benchmark data available for chRNAs detection. The definition of a classification baseline is lacking in the related literature too. In this work we are moving towards benchmark data and an evaluation of the fidelity of supervised classifiers in the prediction of chRNAs.
We proposed a modelization strategy that can be used to increase the tools performances in context of chRNA classification based on a simulated data generator, that permit to continuously integrate new complex chimeric events. The pipeline incorporated a genome mutation process and simulated RNA-seq data. The reads within distinct depth were aligned and analysed by CRAC that integrates genomic location and local coverage, allowing biological predictions at the read scale. Additionally, these reads were functionally annotated and aggregated to form chRNAs events, making it possible to evaluate ML methods (classifiers) performance in both levels of reads and events. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95 % (ACC=94 %, Kappa=0.87 %). The resulting classification models were also tested on real RNA-seq data from a set of twenty-seven patients with acute myeloid leukemia (AML).
KeywordsChimeric RNAs Transcriptomics Classification Ensemble learning Data simulation
A chimeric RNA (chRNA) is a RNA molecule that is made of two or more pieces of RNAs from different loci. Chimeric RNAs are formed from genetic rearrangements like translocation, inversion, deletion or copy number variation. Other studies have highlighted that many other chRNA could be also generated from post-transcriptional mechanisms such as read-through and splicing, transcription of short homologous sequence slippage or from trans-splicing .
High-throughput sequencing technology and bioinformatics have identified chRNA, raising the possibility of chRNAs expressing particularly in diseases can be used as potential biomarkers in both diagnosis and prognosis. However, only a limited number of chimeric transcripts and their associated protein products have been characterised to date, most of them resulting from chromosomal translocation and associated with cancer [2, 3]. For instance the well-known gene fusion in chronic myelogenous leukemia leading to an mRNA transcript that encompass the 5’ end of the BCR gene and the 3’ end of the ABL gene, producing the chRNA BCR–ABL.
Most protocols used to identify chimeric transcripts rely on a reverse transcription step and the reverse transcriptase is known to switch templates, thus creating chimeric artifacts in vitro . Thus, it remains unclear what proportion of putative chimeric transcripts are true, and of these how many are translated. With the rapid evolution of genomic research, next generation sequencing (NGS) has become widely available. Sequences in databases or sequences directly from RNA-seq often contain information that leads to false results when they are analysed with bioinformatics approaches. This is partially due to the fact that the first stage of the available bioinformatic tools for finding chRNA basically rely on mapping, where reads are aligned with respect to the reference genome used by the algorithm. Additionally, the length and depth of sequencing might be inadequate for covering all exon junctions, and given that repeated sequences are present, clearly, the alignment of sequences to specific regions is highly biased to sequencing. Next to mapping, a set of filters based on several technological or biological assumptions are applied to reduce the set of putative chRNAs.
Many tools have been developed to detect chRNAs , though many complications arise when trying to provide a complete benchmark. Each one of these tools have particular constraints regarding installation, parametrization, usability, and database(s) or software(s) dependencies. Liu et al.  provided an overall evaluation of thirteen strategies, suggesting the combination of three of them as a robust solution. In fact, there is a trend to explore combination of algorithms to boost the analysis of chimera detection. However, a good combination of software requires proper criteria of selections based on metrics according to the biological question in order to choose the most suitable ones. In fact, the basic information shared in common by all these softwares is the initial sequenced reads which is distinctively explored by each of them. In other words, each software integrates its own algorithm and different external features and database to define proper margins or “rule(s)” to classify potential chRNAs. If one is able to make such abstraction (or rules) from the reads, it might be an interesting alternative to directly explore models generated from classifiers. In this work, we propose a different angle of discussion, model-oriented rather than comparing software for chRNAs detection which are not necessarily comparable because there are questions-oriented.
The task of discriminating true chRNA from the false ones poses an interesting Machine Learning (ML) challenge. First of all, the sequencing data is highly biased and thus predicting the real signal from the noise can be a hard task. Furthermore, even if we succeed to have a proper set of observations (enough sequencing data) about true chRNAs, chances are that the devised model can not be able to generalize beyond it.
Models generalization is the “holy grail” of machine learning. ML methods must be able to see beyond the data observed in order to generalize beyond it. The famous “no free lunch” theorem of Wolpert says that no learner can beat a random guessing over all possible functions to be learned. Ensemble strategies overcome the no-free-lunch dilemma by combining the outputs of many classifiers assuming that each classifier performs well in particular domains while being sub-optimal in others. Machine learning and ensemble based systems has been applied successfully in several data applications [7, 8]. The main idea behind the ensemble strategy is simple, “rely on the feedback of a bunch of specialists”. Thus, a combination of several models might better estimate the margins boundaries in the hypothesis space, reducing the bias and variance errors. Bias is a learner’s tendency to consistently learn the same wrong thing. Variance is the tendency to learn random things irrespective of the real signal . Empirical and theoretical evidence show that some ensemble learning techniques like bagging act as a variance reduction mechanism, i.e., they reduce the variance component of the error. Boosting strategies reduce both the bias and variance parts of the error. It sounds that the bias error is usually reduced in the early iterations, while variance error decreases in later ones. Indeed, the key component behind ensemble success is the concept of classifier diversity . Classifier diversity can be achieved in several ways. The most known approach is to use different training datasets to train individual classifiers. Such datasets are often obtained through resampling techniques, such as bootstrapping or bagging, where training data subsets are drawn randomly, usually with replacement, from the entire training data.
Most of the available gene fusion-finders methods [5, 10] do not explore the potentials of applying ML techniques to this classification problem. As far as we are aware of, only two gene fusion-finders, namely EricScript  and deFuse , have explored AdaBoost. AdaBoost generates a set of hypotheses and combines them through weighted majority voting of the classes predicted by the individual hypotheses. The hypotheses are generated by training a weak classifier, using instances drawn from an iteratively updated distribution of the training data. This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier. Hence, consecutive classifiers’ training data are geared towards increasingly hard-to-classify instances. As fas as we are concerned, there is no further literature in finding chRNAs using bagging-oriented strategies. Bagging is one of the pioneers in ensemble learning. Diversity in bagging is obtained by using bootstrapped replicas of the training data: different training data subsets are randomly drawn, with replacement, from the entire training data. Each training data subset is employed to train a distinct classifier of the same type. Individual classifiers are then combined by taking a majority vote of their decisions. For any given instance, the class chosen by most classifiers is the ensemble decision. Random Forest (RF) is a variation of the bagging algorithm, and it can be created from individual decision trees, whose certain training parameters vary randomly. Such parameters can be bootstrapped replicas of the training data, as in bagging, but they can also be different feature subsets as in random subspace methods. In this work we have successfully explored RF models in the prediction of chRNAs in both read and event levels.
Like any other machine learning problem, the first big issue is about the data . As far as we are concerned, there is no common benchmark data available. Indeed, even the definition of a classification baseline is lacking in the related literature. In this work we are moving towards a benchmark data and a fair comparison analysis unraveling the role of supervised classifiers in finding chRNAs. All benchmark data and results are freely available at https://sites.google.com/site/alvesrco/chimeres.
Simulated data sets
This procedure modifies the sequence of the input reference genome by introducing random point mutations (SNV), insertions and deletions (or indels), as well as translocations. Substitutions and indels are introduced at random genomic locations at rates chosen by the user. By default, one every 1,000 nucleotides will be substituted (a rate of 0.1 %), while at 1/10,000 positions, an indel will be introduced (a rate of 0.01 %). At a substituted position, the new nucleotide is chosen at random with equal probability among the three other possibilities. For indels, the length is chosen uniformly within a range [1,15] nucleotides and the inserted sequence are chosen randomly. Regarding translocations, whose goal is to generate gene fusions, the exchanged chromosomal locations are chosen within annotated genes. The genome simulator takes a gene annotation file as input to get the positions of all exons. For a translocation, two exons of two different genes are chosen at random with an input probability, then we perform a bidirectional exchange of the 3 prim part of one gene with the 5prim part the other gene and the recripocal, thereby creating two chimeric, or fusion genes. Exons are never splitted by this process, as we hypothesized that fusion genes with disrupted exons are counterselected by evolution. For the sake of simplicity, the simulator generates chimeras using reads encoded on the forward strand only. Expressed chimeric RNAs will be covered by RNA-Seq reads; it is worth noting that both fused genes generated by a translocation are, as any other gene, not necessarily expressed, nor covered by reads. Another alternative for generating artificial gene fusions is by using the FUSIM tool , though the pipeline does not simulate RNA-seq data including chimeras, it only provides their associated sequences. The simulated data generator is packaged in a tools called SimCT (not yet published) that can easily integrate new complex chimeric events.
CRAC has been chosen to perform this experimental study for two main reasons. First of all, it is a splice-aware mapping software that integrates the identification of chimeric junctions with a high confidence ratio. The second reason is that CRAC’s algorithm is based on the study of k−mer support and location profiles which provides for every k−mer of a read its number of occurrences in the whole read collection and all its possible locations on the reference genome. These profiles are of an extremely valuable source of information to extract quantitative features that define chimeric junctions at the read “level” and therefore, suitable for the development of ML models [see Additional file 1].
The chRNAs reads distribution within the progressive sampling
Supervised learning strategies
The challenge of finding chRNAs can be seen as a binary classification problem (positive class: true chimera and negative class: false chimera) . Once having the mutated genomes we can move towards the evaluation of several supervised learning strategies for mining chimeras. The comparison study takes into account distinct supervised learning strategies, namely, one lazy learner (KNN: K-Nearest Neighbors), one eager-learner (SVM: Support Vector Machines) and two ensemble learners (RF: Random Forest and GBM: Grandient Boosting Machines). Each ML strategy has a particular way to generalize the search space.
Nearest-neighbor classifiers are based on learning by analogy, by comparing a given test instance with training instances that are similar to it. Support Vector Machines uses a linear model to implement nonlinear class boundaries. SVM transform the input using a nonlinear mapping, thus, turning the instance space into a new space. Random forest is a well-known ensemble approach for classification tasks proposed by Breiman. Its basis comes from the combination of tree-structured classifiers with the randomness and robustness provided by bagging and random feature selection. Gradient boosting machines consecutively fits new models to provide more accurate estimate of the response variable. Thus, new base learners are built being maximally correlated with the negative gradient of the loss function associated with the entire ensemble. The loss function applied can be arbitrary, so as an example, if the error function is squared-error loss, the learning procedure would result in consecutive error fitting.
Model building and evaluation was carried out with the caret R package . We used the built-in tune() function for resampling, tuning and optimisation of all selected classifiers. We have also adopted a progressive sampling strategy to better understand the impacts of adding more observations to the classifiers performance. We first started using one simulated data set (incorporating a mutation process), and next more data were added sequentially up to having all five genomes. To each run we have set aside one third of the data for the test phase while the remaining two third were used for model building. Training were performed with a 10-fold cross validation scheme along with a tuning grid to each ML technique. Additionally, given that the data split (training/test) were random, we repeated each run three times to check models stability, providing fifteen performance points. As an example r11, r12, and r13 are subsets of the r1 data set (Fig. 2b and d). Models’ performance were average to each run (Fig. 2c).
Towards a classification baseline at the “read” level
Give the absence of a classification baseline for the problem of finding chRNAs, let us assume, essentially, the performance results we got from the first run (r1) of KNN as the initial baseline. The baseline accuracy is also known as the null rate. In machine learning it is also a good practice to start with simple classifiers rather than sophisticated ones . The KNN provided average AUC performance of 88 % (ACC=87 %, Kappa=0.71). Kappa measures how closely the instances labeled by the classifiers matched the data labeled as ground truth, controling for the ACC of a random classifier as measured by the expected accuracy. Thus, the Kappa for one classifier is properly comparable to others Kappa’s classifiers for the same classification problem. We will make use of Kappa as the unbiased metric to compare the selected classifiers along all the progressive sampling strategy.
The general idea behind of a progressive sampling is straightforward. We start with some chRNA reads arising from a individual genome and then, iteratively, build a model, evaluate its performance, and acquire additional chRNA reads from another individual genome. Initially, we were planing to use a simple stop criteria along sampling until getting a performance plateau. Thus, we might be able to stop having more data in the third run (Fig. 2b). However, we decided to have two more additional runs, and, as it can be observed, all models loosed generalization power in the fourth run. All simulations have the same genome simulator parameters and mutation profile and expression profiles, though “randomness” plays quite distinctly to the fourth ones. It might be partially due to the fact that the associated data sets are unbalanced but, every run has about the same proportion of true and false chRNAs (see Table 1). Another hypothesis is that, perhaps, mutations in the fourth genome is biased to patterns in alternative splicing.
The average performance of classifiers at the read level using the top-3 more discriminative features (score_break_length, coefficient_variation, score_is_duplicate)
It is important to mention that a grid of optimisation were applied to all supervised classifiers. Surprisingly, the SVM provided one of the most unstable results, even when compared to KNN. We hypothesise that the search space might be a set of Boolean functions, and given that both RF and GBM employ tree stumps, random perturbations are handled efficiently by ensemble of Booleans rules generated by these classifiers. Thus, in this particular study, indeed, a more powerful learner is better than a less powerful one. Ensemble learning strategies demonstrated to be more robust to this classification problem, providing an average AUC performance of 95 % (ACC=94 %, Kappa=0.87 %) (Fig. 2).
Towards a classification baseline at the “event” level
Classifiers performance at both read and event level
read level (CRAC)
The performance of the classification at the event level, as it might be observed in (Table 3), is not better than a “random guess”. In fact, this is mainly due to the fact that i) aggregation does not take into account a customized classification model to discriminate chimeric events, and ii) chimeric (false positives) reads continue passing through the filters (CracScore), causing trouble to the classification at the event level. Therefore, it is advisable to employ a stacking model, so chimeric reads that are not properly filtered out at the read level, can be, with the advantages of having additional functional information, be enhanced to engineering new features playing interesting role at building a model at the event level.
In order to provide a classification baseline at the event level, we carried out new simulations, avoiding the bias that might be associated while evaluating classification performance at both levels using the same data. In fact, it opens the possibility to explore new covariates, specifically suitable for chRNA predictions at the event level. We have generated six data sets (3 RNA-seq 100bp + 3 RNA-seq 200 bp), corresponding to 1634 chimeric events (1257 = False and 377 =True). A feature set of eleven covariates were carefully designed, though, without loss of generality, only the three most discriminative ones are described: i) coefficient_variation, false chimeric events present higher variability within the junction support than true ones. This is partially due to false localisation inferred wrongly by repetitive regions; ii) fusion_distance, the larger the fusion distance is, the less is the likelihood of the event be a false positive; iii) junction_support, once the support is low one might expect of being a false event, give that the junction support is covered by a few or none reads.
The overall performance of classifiers at the event level using the top-3 more discriminative features (coeficient_variation, fusion_distance, juntion_support)
The fidelity of supervised classifiers in the prediction of chRNAs
We have performed an evaluation using another Human mutated genome, as an independent test set, and the results again are promising for ensemble learners as RF and GBM. We again use our genome simulator, and a total of 80 millions of paired-end reads with read length of 100bp were generated for this new mutated genome. For this last evaluation, we have selected the respective models for each classifier obtained in sampling r5 to predict chRNAs within this new genome (Fig. 2e). Despite small performance differences across ensembles, the RF provided the best overall results (AUC=98 %) (Fig. 2e).
We carried out another evaluation using a generated mutated Drosophila genome (40M reads having 100bp) to check whether the Human-based models could predict efficiently chRNAs in other, let us say, distant organism. As it was expected no classifier were better than a random guessing. Such observation is due to the fact both organisms have a distinct “chimeric profiles” with respect to the number of chromosomes, the arrangements of the exons and evolutionary rate of mutations. Therefore, classification margins boundaries built for Human cannot hold for Drosophila.
Performance of classifiers at both read and event levels using 100 bp
Simulated RNA-Seq 40M reads 100 pb
Performance of classifiers at both read and event levels using 200 bp
Simulated RNA-Seq 40M reads 200 pb
The proposed benchmark highlights the fidelity of supervised classifiers in the problem of chRNAs prediction. Though, it is always important performing biological validation of these events to confirm the biolo gical soundness of all discovered chRNAs.
We proposed a modelization strategy that can be used to increase the tools performances in context of chRNA classification. We improve the models dynamically using a simulated data generator, called SimCT (not yet published), that permit to continuously integrate new complex chimeric events.
We have developed a benchmark pipeline incorporating a mutation genome process and simulated RNA-seq data to evaluate the fidelity of using classifiers to predict chRNAs. The simulated sequencing reads were aligned and annotated by CRAC. CRAC analyzes the RNA-seq data, integrating genomic location and local coverage, allowing biological predictions in one single step which is not available in any other fusion finder tools. Moreover, we enhance chRNAs classification by building models at both levels of read and event. Aggregating reads at the event level allows us to incorporate biological covariates and improving models’ generalization along with the presented experimental study. We envisage focusing on developing models and improving biological annotation at event level which will be addressed in a future work.
Simulated and real RNA-seq data were carefully designed to explore feature engineering and model performance of several classifiers, unraveling the role of ensemble learning strategies in finding chRNAs. As far as we are concerned the present study provides the first common benchmark data, as well as classification baseline for the identification of chRNAs in both levels of reads and events.
Acute myeloid leukemia
Area under the curve
Gene expression Ominibus
Cohen’s kappa coefficient
Single nucleotide variants
Sacha Beaumeunier and Ronnie Alves are fellowships of the IBC (ANR “investissement d’avenir en bioinformatique-projet-IBC); and Jérôme Audoux is fellowship of the FRM (Appel d’offres urgence pour la bioinformatique, projet DBI20131228566).
This work was supported by the French ANR IBC project “investissement d’avenir en bioinformatique-projet-IBC” and FRM “appel d’offres urgence pour la bioinformatique, projet DBI20131228566”.
Availability of data and material
The dataset(s) supporting the conclusions of this article is(are) available in the project’ site repository at https://sites.google.com/site/alvesrco/chimeres.
TC, RA, NP, JA and SB initiated the project. SB, JA, NP and RA performed the analytical work and implementation of the pipeline. SB and RA performed the experimental work and contributed to pipeline testing. TC, FF and AB contributed to biological discussion. All authors contributed to the writing of the paper. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Gingeras TR. Implications of chimaeric non-co-linear transcripts. Nature. 2009; 461(7261):206–11. doi:10.1038/nature08452.View ArticlePubMedPubMed CentralGoogle Scholar
- Yoshihara K, Wang Q, Torres-Garcia W, Zheng S, Vegesna R, Kim H, Verhaak RGW. The landscape and therapeutic relevance of cancer-associated transcript fusions. Oncogene aop(current). 2014. doi:10.1038/onc.2014.406.
- Ding L, Wendl MC, McMichael JF, Raphael BJ. Expanding the computational toolbox for mining cancer genomes. Nat Rev Genet. 2014; 15(8):556–70. doi:10.1038/nrg3767.View ArticlePubMedPubMed CentralGoogle Scholar
- Ren G, Zhang Y, Mao X, Liu X, Mercer E, Marzec J, Ding D, Jiao Y, Qiu Q, Sun Y, Zhang B, Yeste-Velasco M, Chelala C, Berney D, Lu YJ. Transcription-mediated chimeric rnas in prostate cancer: Time to revisit old hypothesis?. OMICS J Integr Biol. 2014; 18(10):615–24. doi:10.1089/omi.2014.0042.View ArticleGoogle Scholar
- Carrara M, Beccuti M, Lazzarato F, Cavallo F, Cordero F, Donatelli S, Calogero RA. State-of-the-art fusion-finder algorithms sensitivity and specificity,. BioMed Res Int. 2013; 2013:1–6. doi:10.1155/2013/340620.View ArticleGoogle Scholar
- Liu S, Tsai WH, Ding Y, Chen R, Fang Z, Huo Z, Kim S, Ma T, Chang TY, Priedigkeit NM, Lee AV, Luo J, Wang HW, Chung IF, Tseng GC. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end rna-seq data. Nucleic Acids Res. 2015. doi:10.1093/nar/gkv1234, arxiv http://nar.oxfordjournals.org/content/early/2015/11/17/nar.gkv1234.full.pdf+html.
- Polikar R. Ensemble based systems in decision making. Circ Syst Mag IEEE. 2006; 6(3):21–45. doi:10.1109/mcas.2006.1688199.View ArticleGoogle Scholar
- Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armananzas R, Santafe G, Perez A, Robles V. Machine learning in bioinformatics,. Brief Bioinform. 2006; 7(1):86–112. doi:10.1093/bib/bbk007.View ArticlePubMedGoogle Scholar
- Domingos P. A few useful things to know about machine learning. Commun ACM. 2012; 55(10):78–87. doi:10.1145/2347736.2347755.View ArticleGoogle Scholar
- Carrara M, Beccuti M, Cavallo F, Donatelli S, Lazzarato F, Cordero F, Calogero R. State of art fusion-finder algorithms are suitable to detect transcription-induced chimeras in normal tissues?. BMC Bioinformatics. 2013; 14(S-7):2. doi:10.1186/1471-2105-14-S7-S2.View ArticleGoogle Scholar
- Benelli M, Pescucci C, Marseglia G, Severgnini M, Torricelli F, Magi A. Discovering chimeric transcripts in paired-end rna-seq data by using ericscript. Bioinformatics. 2012. doi:10.1093/bioinformatics/bts617 http://bioinformatics.oxfordjournals.org/content/early/2012/10/23/bioinformatics.bts617.full.pdf+html.
- McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MGF, Griffith M, Heravi Moussavi A, Senz J, Melnyk N, Pacheco M, Marra MA, Hirst M, Nielsen TO, Sahinalp SC, Huntsman D, Shah SP. defuse: an algorithm for gene fusion discovery in tumor rna-seq data. PLoS Comput Biol. 2011; 7(5):1001138. doi:10.1371/journal.pcbi.1001138.View ArticleGoogle Scholar
- Bruno A, Miecznikowski J, Qin M, Wang J, Liu S. Fusim: a software tool for simulating fusion transcripts. BMC Bioinformatics. 2013; 14(1):13. doi:10.1186/1471-2105-14-13.View ArticlePubMedPubMed CentralGoogle Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigó R, Sammeth M. Modelling and simulating generic RNA-Seq experiments with the flux simulator,. Nucleic Acids Res. 2012; 40(20):10073–83. doi:10.1093/nar/gks666.View ArticlePubMedPubMed CentralGoogle Scholar
- Philippe N, Salson M, Commes T, Rivals E. CRAC: an integrated approach to the analysis of RNA-seq reads,. Genome Biol. 2013; 14(3):30. doi:10.1186/gb-2013-14-3-r30.View ArticleGoogle Scholar
- Kuhn M. Building Predictive Models in R Using the caret Package. J Stat Softw. 2008; 28(5):1–26.View ArticleGoogle Scholar