Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction

Cappelli, Eleonora; Felici, Giovanni; Weitschek, Emanuel

doi:10.1186/s13040-018-0184-6

Methodology
Open access
Published: 25 October 2018

Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction

BioData Mining volume 11, Article number: 22 (2018) Cite this article

7971 Accesses
21 Citations
3 Altmetric
Metrics details

Abstract

Background

In the Next Generation Sequencing (NGS) era a large amount of biological data is being sequenced, analyzed, and stored in many public databases, whose interoperability is often required to allow an enhanced accessibility. The combination of heterogeneous NGS genomic data is an open challenge: the analysis of data from different experiments is a fundamental practice for the study of diseases. In this work, we propose to combine DNA methylation and RNA sequencing NGS experiments at gene level for supervised knowledge extraction in cancer.

Methods

We retrieve DNA methylation and RNA sequencing datasets from The Cancer Genome Atlas (TCGA), focusing on the Breast Invasive Carcinoma (BRCA), the Thyroid Carcinoma (THCA), and the Kidney Renal Papillary Cell Carcinoma (KIRP). We combine the RNA sequencing gene expression values with the gene methylation quantity, as a new measure that we define for representing the methylation quantity associated to a gene. Additionally, we propose to analyze the combined data through tree- and rule-based classification algorithms (C4.5, Random Forest, RIPPER, and CAMUR).

Results

We extract more than 15,000 classification models (composed of gene sets), which allow to distinguish the tumoral samples from the normal ones with an average accuracy of 95%. From the integrated experiments we obtain about 5000 classification models that consider both the gene measures related to the RNA sequencing and the DNA methylation experiments.

Conclusions

We compare the sets of genes obtained from the classifications on RNA sequencing and DNA methylation data with the genes obtained from the integration of the two experiments. The comparison results in several genes that are in common among the single experiments and the integrated ones (733 for BRCA, 35 for KIRP, and 861 for THCA) and 509 genes that are in common among the different experiments. Finally, we investigate the possible relationships among the different analyzed tumors by extracting a core set of 13 genes that appear in all tumors. A preliminary functional analysis confirms the relation of part of those genes (5 out of 13 and 279 out of 509) with cancer, suggesting to focus further studies on the new individuated ones.

Peer Review reports

Introduction

Next Generation Sequencing (NGS) techniques have revolutionized the sequencing of genomes, producing large quantities of DNA and RNA data [1–4]. This abundance of data allows us to perform analyses on the genetic makeup of human subjects, studying the predisposition to diseases like cancer [5–8]. NGS techniques are not only applied to DNA sequencing [9], but also to other types of experiments, e.g.: transcriptome profiling (RNA sequencing) [10, 11], microRNA sequencing (miRNA-seq) [12], protein-DNA interactions (Chip-Seq) [13], identification of Copy Number Variation (CNV) [14], and characterization of the epigenome or chemical changes in the DNA (DNA methylation) [15–17].

In this work, we are going to focus on DNA methylation and RNA sequencing, as these two NGS experiments have been proven to play an important role in knowledge discovery in cancer [18–25].

DNA methylation is one of the most studied epigenetic changes in human cells. The changes in DNA methylation patterns are crucial in the development of diseases and in many forms of cancer [26–31]. Most NGS methods are based on bisulfite conversion to determine the percentage of methylated cytosines in a CpG island. This measure is called beta value [32], and is defined as the ratio between the methylated allele intensity and the overall intensity. For more details about the DNA methylation experimental techniques the reader may refer to [17, 33].

RNA sequencing is a next generation sequencing technique for the analysis of the transcriptome and its quantification. Four main methods for measuring gene expression are used in practice: i) Reads Per Kilobase per Million mapped reads (RPKM)[10]; ii) Fragments Per Kilobase per Million mapped (FPKM) [34]; iii) RNA-Seq by Expectation-Maximization (RSEM) [11, 35]; iv) Transcripts Per Kilobase Million (TPM) [36]. For further details about RNA sequencing, we point the reader to [37], where the authors perform a comprehensive overview of this NGS technique.

In this work, we define NGS data the information extracted from a NGS experiment (i.e., Chip-sequencing, DNA methylation, DNA sequencing, RNA sequencing), e.g., the counts of the reads that map on given list of genes in RNA sequencing. We define NGS meta data the information related to the NGS experiment and the sequenced tissue, e.g., the tissues status (tumoral, normal), or the sequencing depth. We define NGS data integration the procedure of joining different experiments (possibly extracted from heterogeneous databases) sharing common features (e.g., same disease / patient under study) in order to extract knowledge. The aim of integration is to aggregate genomic data in an unique schema that provides querying capabilities for retrieving data from a multitude of heterogeneous experiments and databases. Heterogeneous data are the first problem of NGS, because the structure of data is different in diverse experiments and can be different in diverse databases. Therefore, the term integration in NGS data can have different meanings [38]. On one hand, we consider integration for a need to have a uniform language that facilitates the access to different genomic databases. On the other hand data heterogeneity is caused by the experiment types and by the information that they bring. It is worth noting that dis-uniformity of the data schema is present not only when considering different databases, but also when dealing with a single one. We distinguish four conditions, where NGS data integration can be performed: (i) different databases represent the same NGS experiment (e.g. RNA-Seq) with different data schemas; (ii) different experiments (e.g., DNA methylation and RNA-Seq) in distinct databases; in this case there are two different data schemas, because the experiments need a different representation, but no standardization of the schemas is defined that allows the access to these experiments; (iii) the same problem exists even in the same databases, which contains different experiments and different data representation schemas. Finally, we consider an ideal case (iv) where a previously defined schema standardization allows to integrate different experiments that come from different databases or from the same database, and it allows also to provide interoperability between the same experiments but with different schemas. An example of this type of standardization is provided by [39] with the Genomic Data Model (GDM) that supports many NGS formats.

In order to have access to the right resources, it is necessary to define standard schemas of these data to avoid redundant information overlaps. Several efforts have been made on NGS data formats and standards. The authors of [40] provide the reader with an overview of the most widespread data formats for NGS and describe a set of standardization approaches for them. In [41] the NCBI Entrez search and retrieval system used at the National Center for Biotechnology Information to access distributed heterogeneous data is described. Also the authors of [42] present a text search engine to access data resources in the European Bioinformatics Institute (EMBL-EBI) and to help understand the relationship between different data types. Other implementations for bioinformatics data integration include retrieval systems like SRS [43] and integration tools for information fusion such as BioData Server [44]. The integration of genomic data involves multiple fields, i.e., bioinformatics, statistics, data mining, and classification. But the question is, does the integration of different types of NGS experiments offer additional knowledge about a disease like cancer [45]?

In this work, we address the issue of combining RNA sequencing and DNA methylation experiments, which have different data schemas containing heterogeneous information. Our aim is to obtain a gene oriented organization of both experiments, and therefore we define a new measure on DNA methylation data called gene methylation quantity. We combine RNA sequencing and DNA methylation data of The Cancer Genome Atlas (TCGA) [46] and test our method on genomic data related to three types of cancer: Breast Cancer, Kidney Renal Carcinoma, and Thyroid Carcinoma.

Additionally, we analyze the combined data by means of supervised classification algorithms, extracting classification models, which are able to distinguish the samples in two classes (tumoral and normal) and which are composed of features that represent the genes related to the disease and the different NGS experiment.

In cancer research many computational methods deal with classification problems, e.g., disease characterization, prognosis, treatment response of patients, mutation pathogenicity, biomarker prediction, and sample malignancy. A recent effort has achieved good performance in the assignment of disease subtypes and malignancy labels to melanoma images with convolutional neural networks [47]. Further studies used typical machine learning methods [48], including Adaboost [49] and decision trees [50].

Among them, we focus on a new supervised learning method that is able to extract more knowledge in terms of classification models than state of the art ones, called Classifier with Alternative and MUltiple Rule-based models (CAMUR) [51]. CAMUR is designed to find alternative and equivalent solutions for a classification problem building multiple rule-based classification models. Standard classifiers tend to extract few rules with a small set of features for discriminating the samples, and interesting features may remain hidden from the researcher. Thanks to an iterative classification procedure based on a feature elimination technique, CAMUR finds a large number of rules related to the classes present in the dataset under study. CAMUR is based on: (i) a rule-based classifier, i.e., RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [52]; (ii) an iterative feature elimination technique; (iii) a repeated classification procedure; (iv) a storage structure for the classification rules. The method calculates iteratively a rule-based classification model through the RIPPER algorithms [52], deletes iteratively the features that are present in the rules from the dataset, and performs the classification procedure again, until a stopping criterion is met, i.e., the classification performance is below a given threshold or the maximum number of iterations has been reached. CAMUR has been implemented specifically for case-control studies that aim to identify subjects by their outcome status (e.g., tumoral or normal). In these data, the features correspond to the gene expressions of the samples, the classes to the investigated diseases or conditions (e.g., tumoral, normal). The extracted knowledge by CAMUR consists in a set of rules composed of a given number of genes that might be relevant for a disease. CAMUR also includes an offline tool to analyze and to interpret the computed results. Thus the software consists of two parts: (i) The Multiple Solutions Extractor (MSE), which corresponds to the implementation of the iterative classification algorithm (i.e., for each iteration it deletes the selected features, performs the classification, and saves the extracted models); (ii) The Multiple Solutions Analyzer (MSA), a graphical tool for analyzing and interpreting the obtained results. CAMUR is available at http://dmb.iasi.cnr.it/camur.php as stand alone software; for a comprehensive description we point the reader to [51].

In this work, thanks to the application of machine learning algorithms, we show the advantage of combining DNA methylation and RNA sequencing data, i.e., the increase of extracted knowledge resulting in combinations of genes from both experimental strategies. Finally, we study the three types of cancer and identify sets of relevant genes. The intersection of them results in a smaller set of genes that should be considered for further investigation.

Methods

In this section, we discuss the methods used to combine the genomic experiments (RNA sequencing, DNA methylation) and the classification algorithms used to extract knowledge from them. We start by describing the source where we extract the data.

Data source: the Cancer Genome Atlas

The Cancer Genome Atlas (TCGA) [46] is a project that aims to create a major repository for cancer, including NGS experiments, to improve the ability to diagnose, treat and prevent cancer through a better understanding of the genetic basis of this disease. The TCGA database contains the genomic characterization and analysis of 33 types of cancer. Tissue samples are processed through different types of techniques such as gene expression profiling (i.e., RNA sequencing and microarrays); profiling of methylated DNA (i.e., DNA methylation obtained both with NGS techniques and microarrays); profiling of microRNA (i.e., miRNA sequencing); whole genome sequecing (i.e., DNA sequencing). We rely on the latest TCGA data release available at The Genomic Data Commons platform (http://gdc.cancer.gov/).

In TCGA each tissue of DNA methylation is represented with a list of following fields: gene symbol, chromosome and genomic coordinates (where the methylation occurs), and its beta value (methylation values). RNA sequencing data instead contains information on the RSEM values [11] measured on the considered genes. It is worth noting that our approach handles gene expression data of RNA sequencing, which has been previously normalized, DNA methylation data containing the beta value, and can be used to treat also DNA methylation and RNA sequencing data of different pathologies.

Data processing and combination

We create data matrices of RNA sequencing and DNA methylation experiments in the following way. Consider n samples (tissues) each one with m features (genes) and a class label (condition), which indicates whether the sample is normal or tumoral. A data matrix is composed by n vectors as F_i = (f_i,1,f_i,2,...,f_i,m,f_i,c), which represent sample i, where f_i,j ∈$\mathbb {R}$; i=1,...,n; j=1,...,m; f_i,c ∈ {normal,tumoral}. When considering RNA sequencing, the rows represent the samples, the columns the genes (except the last that represents the class labels) and the items of the matrix contain the RSEM gene expression values for each gene. The structure of this matrix is shown in Table 1. When considering DNA methylation, the corresponding matrix is composed by the rows that represent the samples, the columns that represent the genes, while the items contain a new measure that represent the quantity of methylation associates to each gene and that is explained in the following. Indeed, for DNA methylation TCGA encloses the beta values for each methylated site, so each sample has s methylated sites, l of them belonging to a given gene. For aggregating the methylation quantity at gene level, we consider the sum of the beta values as a measure of the overall intensity of the methylation on a gene. Let a_ijh be the methylation quantity associated to the sample i with i=1,..,n, to the gene j with j=1,....m, and to the methylated site h with h=1,..,l. Then we have $b_{i,j} = \sum _{h=1}^{l} a_{ijh}, \forall i,j$. In the following, we refer to this new measure as gene methylation quantity. It is worth noting that we consider the beta values of CpG sites with a related gene symbol, i.e., the symbol of the gene where the methylation occurs. If a methylation occurs on other genomic regions it is not considered in our data processing procedure, whose aim is to provide a gene oriented data organization. In Table 2 we show the structure of the DNA methylation matrix. A software tool, which performs the data extraction and the creation of the matrices, is freely available at http://bioinf.iasi.cnr.it/genint. The flowchart that reports the computational steps of the software is depicted in Fig. 1.

Table 1 Structure of the RNA sequencing matrix

Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction

Abstract

Background

Methods

Results

Conclusions

Introduction

Methods

Data source: the Cancer Genome Atlas

Data processing and combination

Analysis method

C4.5

Random Forest

RIPPER

CAMUR

Results

Performed tests

Discussion

Gene methylation quantity

Correlation between DNA methylation and RNA sequencing

Tree-based classification models of C4.5

Tree-based classification models of Random Forest

Rule-based classification models of RIPPER

Rule-based classification models of CAMUR

Genes extracted by CAMUR

Conclusion

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional File 1

Additional File 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us