- Open Access
- Open Peer Review
Within-sample co-methylation patterns in normal tissues
BioData Miningvolume 12, Article number: 9 (2019)
DNA methylation is an epigenetic event that may regulate gene expression. Because of this regulation role, aberrant DNA methylation is often associated with many diseases. Within-sample DNA co-methylation is the similarity of methylation in nearby cytosine sites of a chromosome. It is important to study co-methylation patterns. However, it is not well studied yet, and it is unclear to us what co-methylation patterns normal DNA samples have. Are the co-methylation patterns of the same tissue across several samples different? Are the co-methylation patterns of various tissues of the same sample different? To answer these questions, we conduct analyses using two sets of data: 3-sample-1-tissue (3S1T) and 1-sample-8-tissue (1S8T).
To study the co-methylation patterns of the two datasets, 3S1T and 1S8T, we investigate the following questions: How often does one methylation state change to other methylation states and how is this change associated with chromosome distance? Based on the 3S1T data, we find there is not significant co-methylation difference among the same spleen tissues of three different samples. However, the analysis results of 1S8T data show that there were significant differences among eight tissues of one sample. For both 3S1T and 1S8T data, we find that the no/low methylation state A and high/full methylation state D tend to remain the same along a chromosome region. We also find that the low/partial methylation state B and partial/high methylation state C tend to change to higher methylation states along a chromosome. Finally, we find that lengths of most co-methylation regions are very short with only a few hundred base pairs. In fact, only a small proportion of methylated regions are longer than 1000 base pairs.
In this paper, we have addressed a few questions regarding within-sample co-methylation patterns in normal tissues. Our statistical analysis results and answers may help researchers to better understand the biological process of DNA methylation. This may pave the way to develop better analysis methods for future methylation research.
Epigenetics is generally understood as the study of heritable changes that are related to gene functions and cannot be explained by changes in DNA sequences. These changes are frequently termed epigenetic events or marks. One main type of epigenetic event is DNA methylation, which is the addition of a methyl group to a 5′ cytosine base . DNA methylation plays an important role in transcription and thus affects gene expression [1, 2]. It also plays a significant role in genomic imprinting, X-chromosome inactivation, and suppression of repetitive element transcription and transposition [3,4,5,6,7]. Therefore, it is important to study different methylation patterns in both normal samples and the samples of complex diseases such as cancers.
In a human genome, DNA methylation often occurs at CpG or CG sites. A CG site is a cytosine base followed by a guanine base in DNA sequences. Because DNA methylation predominantly occurs at CG sites, it is important to study DNA methylation at or near CpG islands, which are regions with more cytosine or CG sites. CpG islands are defined as regions of the DNA sequence > 200 bp long with an average CG content > 50% and an observed-to-expected ratio of CG sites > 0.6 . These islands are important regulatory elements in the genome, and contain the most variation in DNA methylation across different tissues. Additionally, methylation of CpG islands in promoter regions can be associated with long-term silencing of gene expression [2, 9].
Researchers can study segments of methylation over a stretch of neighboring CG sites in the same chromosome region. This specific DNA methylation pattern is called co-methylation . It is also called within-sample (WS) co-methylation. The prefix “co-” in the “WS co-methylation” means local CG sites of one short chromosome region methylate or not methylate together in a single sample. That is, we study how neighboring CG sites are similarly methylated or unmethylated within a short chromosome region of one sample. It has also been observed that the “correlation between methylated cytosines decays as a function of genomic distance between methylated loci”. In fact, in Arabidopsis this correlation is observed for distances up to 5000 nucleotides . For human samples, Eckhardt et al. first report that this correlation or co-methylation is over short distances (≤ 1000 base pairs) and it deteriorates rapidly for distances > 2000 base pairs . However, this study is not conducted for the whole genome. In this paper, we will investigate in detail how long this co-methylation pattern can be in the human genome of normal samples using whole genome bisulfite sequencing data.
Before we introduce our study further, we emphasize that, in other research papers [13,14,15,16,17,18,19,20], co-methylation may mean “between-sample (BS) co-methylation”. The prefix “co-” in the “BS co-methylation” means that genes at different regions, especially different chromosomes, methylate together across multiple samples. For example, “BS co-methylation” may mean that two or more genes (e.g., genes 1 and 2) on different chromosomes are hypermethylated together and their functions are somehow related through a co-methylation (or co-expression) network/module [15, 19]. This co-methylation (or “BS co-methylation”) is similar to the concept of “co-expression” of genes. The above explanations and definitions of WS co-methylation and BS co-methylation is consistent with the ones defined in Peter Francis Hickey’s thesis , which has a thorough review of different co-methylation definitions. It is worth noting that although “WS co-methylation” is about the methylation pattern within a single sample, it is important to study “WS co-methylation” across multiple samples and this is the research focus of this paper. To simplify our writing, in this paper we often use “co-methylation” for “WS co-methylation”.
It is important to study within-sample co-methylation as it explains how DNA methylation is instituted in each genomic region in a chromosome. Deep understanding of within-sample co-methylation patterns can help researchers improve DNA methylation assays and statistical analyses of DNA methylation. Thus, we can better understand other genetic and epigenetic events or patterns. Although it is important to study DNA co-methylation, this specific methylation pattern has not been well studied yet. For example, it is unclear to us what co-methylation patterns normal DNA samples have. In this paper, we will conduct analyses to answer the following two questions regarding co-methylation patterns in normal DNA. First, are the co-methylation patterns of the same tissue across several samples different? Second, are the co-methylation patterns of various tissues of the same sample different? In order to answer these questions, we conduct analyses using two sets of data: one is 3-sample-1-tissue (3S1T), another one is 1-sample-8-tissue (1S8T). To study the co-methylation patterns of these two datasets, we specifically investigate how often one methylation state change to other methylation states and how is this change associated with chromosome distance. In the following sections, we first introduce the data we use. We then explain the two analysis methods we will use to answer the above questions. Finally, we will show our comprehensive analysis results.
To compare co-methylation patterns among different samples, we use a dataset of three distinct samples. The samples are referred to as STL001, STL002, and STL003 and each sample has whole genome bisulfite sequencing (WGBS) data from the spleen tissue. These datasets are obtained from the Roadmap Epigenomics Project . We refer to this dataset as 3-sample-1-tissue, or 3S1T. To compare co-methylation patterns among different tissues of the same sample, we use a dataset of eight distinct tissues of the sample STL001. The eight tissues are respectively referred to as bladder, gastric, lung, psoas, sigmoid colon, small bowel, spleen, and thymus. This dataset is obtained from Roadmap Epigenomics Project . We refer to this dataset as 1-sample-8-tissue, or 1S8T. For both 3S1T and 1S8T data, the raw WGBS reads are preprocessed and aligned using BRAT-bw , a publicly available software package, and the human genome hg19 is used as the reference genome. Because our focus is within-sample co-methylation for nearby sites or regions, not between-sample co-methylation in a whole genome, using one chromosome is sufficient to address the questions of interest. Therefore, we only focus on chromosome 1, as it is the longest chromosome. In fact, we use methylation ratios from all CG sites in chromosome 1 of each dataset.
Each methylation sequencing dataset includes four metrics: chromosome, position, sequencing coverage, and MC ratio, see Table 1. At a specific CG site, the “MC ratio” is defined as the number of reads with methylated cytosines divided by the total number of reads covering that CG site. It is similar to the “beta value” that is commonly used for the methylation signals of Illumina array probes. “MC ratio” ranges from 0 to 1 and is the proportion of sequenced reads of a CG site that are methylated. 0 indicates no methylation and 1 indicates full methylation. We then add a fifth metric—the methylation state of each CG site—to all datasets. “Methylation state” refers to each of our divisions of the MC ratio [0,1] into four intervals of methylation. These methylation states are A, B, C, and D. “A” corresponds to no/low MC ratios of [0, 0.25). “B” corresponds to low/partial MC ratios of [0.25, 0.5). “C” corresponds to partial/high MC ratios of [0.5, 0.75). “D” corresponds to high/full MC ratios of [0.75, 1]. CG sites with a sequencing coverage of less than 3 (i.e., <3X) are labeled “NA”. Note, as for the sequencing coverage, for all datasets (samples/tissues), around 95% of CG sites have at least 1X coverage. Only two samples/tissues have 89 and 91% of CG sites with ≥3X coverage. All other samples/tissues have about 95% of CG sites with ≥3X coverage. Lastly, we calculate the distance between consecutive CG sites and add these values in the final column of our data (Table 1). “Consecutive CG sites” mean two CG sites on the same chromosome and at the same DNA strand, that is, there is no other CG site between them. The first and second CG sites are determined based on the reference genome coordinates on the positive or forward strand. In this paper, we only use the positive or forward strand of the DNA.
We prepare the original data as described above, labeling the methylation state of each CG site and calculating distance between consecutive CG sites, so that we may use them in our two analyses. First, we investigate how frequently methylation states change from one CG site to another along a chromosome. Second, we investigate if this change is related to chromosome distance.
We would like to investigate patterns beyond the scope of single sample or tissue. Do the co-methylation patterns vary across different samples or different tissues? We apply our analyses to the different samples and tissues in two datasets, 3S1T and 1S8T. We compare co-methylation patterns across different samples in 3S1T and compare co-methylation across different tissues in 1S8T to answer these questions. The workflow of our analyses is shown in Fig. 1. Next, we explain in detail each analysis method.
In our first analysis, we analyze how methylation patterns change from one CG site to another. Since each CG site has a methylation state assigned to it according to its methylation level, we pair together consecutive CG sites along the genome to form a “methylation state pair”—the two states corresponding to a pair of CG sites. This allows us to observe how methylation can change along a chromosome region. The first state indicates the initial methylation level, while the second state indicates the terminal methylation level. Does methylation stay the same (e.g., AA, BB, CC, DD), relatively similar (e.g., AB, CD), or change drastically (e.g., AD, DA)? Do certain methylation states tend to change to other methylation states?
For each sample, we also calculate the frequency of the distinct state pairs (e.g., AA, AB, AC, AD). This allows us to compare the co-methylation patterns across different samples or tissues (see Chi-squared test below). We learned from our previous research that co-methylation patterns are related to the physical distance between two CG sites in a state pair. Thus, we calculate the frequency of each type of methylation state-pair at increasing intervals of a short distance of 50 bases, i.e., [0, 50), [50, 100), [100, 150), …, [400, 450), [450, 500), to [500, Inf). We then graph the frequencies of different samples or tissues with the same methylation state pair on the same graph to visually observe differences. We also utilize the chi-squared test to determine whether samples or tissues are significantly different. We first test all samples or tissues at once. If there is a significant difference, we further analyze the data by pairwise comparison to determine which combination of samples may be the cause of the difference. If the p-values are extremely small, we may use or combine approaches below to produce better analysis results.
We calculate the contribution of each methylation state pair and each sample to a Chi-squared statistic. This allows us to determine which sample and which methylation state pair is the most significantly different from the others. We display this information in bar graphs for visualization.
We divide our methylation state-pair count data (i.e., Chi-squared input) by factors of n (n = 10, 100, 1000) to account for large-count issues.
In our second analysis, we investigate how long methylation levels stay the same in a chromosome by analyzing the length of similarly methylated regions. We begin by grouping together consecutive CG sites that have the same methylation state (e.g., AAAA, BBBB, CCCC, or DDDD). These groups are called similarly methylated regions (SMRs). We then calculate the number of CG sites in each group (count) and the number of base pairs the group stretches across (region-length or length). These two values are used as metrics that we use to compare different samples or tissues.
Once we calculate counts and lengths, we summarize the counts and lengths of the four different SMRs: A, B, C, and D. We summarize SMRs with a count of at least two CG sites. We then determine whether SMRs from different samples or tissues with the same methylation level (i.e., AAAA in Bladder and AAAAA in Thymus) have significantly different distributions of lengths or of counts. We begin to analyze SMR count/length summaries across different samples or tissues. These summaries provide a preliminary assessment of whether SMR lengths and counts from different tissues or samples may differ. We then use the Kruskal Wallis test to confirm any significant differences, first comparing all samples or tissues and then conducting pairwise comparisons, if necessary.
3S1T data analysis results
First, we show the results of analyzing if different samples of the same tissue have different co-methylation patterns.
Analysis 1 result of 3S1T data
We have labeled and paired consecutive CG sites in the data to form methylation state pairs. Table 2 displays the number of times each methylation state pair occurs and the relative frequency of methylation state pair. With this summary, we can answer the questions about how methylation changes from one methylation state to another.
For the STL001 spleen data in Table 2, methylation state pair DD occurs the most (89.49%), followed closely by the methylation state pair AA (81.37%). This means that no/low (A) or high/full (D) methylation states in the STL001 spleen sample tends to stay within the same methylation state. For example, when looking at just row A, across columns A, B, C, and D, the no/low methylation state A tends to remain the same methylation state, no/low (A). On the other hand, low/partial methylation state B and partial/high methylation state C tends to change to higher methylation states: C and D. For high/full methylation state D, the methylation tends to remain high/full (D). There is not a large percentage of methylation state change from D to A (only about 1%). The STL002 and STL003 spleen samples follow similar patterns with some variation in the actual values.
Comparing co-methylation patterns among 3 samples of the spleen tissue (i.e., 3S1T data)
Since we have observed some variation in the co-methylation patterns among 3 samples of spleen tissue, we want to determine whether these differences are statistically significant. Firstly, we use a visual depiction to ballpark whether the differences are significant, see Fig. 2. Plots in Fig. 2 do not display an obvious difference in change patterns among the three spleen datasets. The three lines depicting each sample do not vary from each other greatly, except that there is a bit of variation between samples in the higher methylation states, e.g., AD, BD, CD, and DD state pairs.
Next, we use chi-squared tests to study whether there are statistically significant differences for the co-methylation patterns of three samples. The null (H0) and alternative (Ha) hypotheses of the chi-squared tests are listed below. H0: There is no difference for the co-methylation patterns among the spleen tissue of three samples. That is, the methylation state change patterns from each methylation state (e.g., A) to all states (A, B, C, D) are the same among the spleen tissues of three samples. Ha: There are differences for the co-methylation patterns among the spleen tissue of three samples. We use our original count data in Table 2 as the input for each sample to conduct the chi-squared test. We first test all three samples for differences. If the results for all three are significant, we will determine which subset of samples are different: we further test pairs of samples.
The chi-squared test results for all three spleen samples of 3S1T dataset are shown in Table 3. The p-values are very small, almost zero. As indicated above, we then conduct pair-wise comparisons between each two of the three samples and we still obtain extremely small p-values (data not shown), which shows that three spleen samples are statistically different. However, we suspect that there may be some outlying co-methylation patterns that causes our p-value to be extremely small. Therefore, we compare the chi-squared contributions of each of the co-methylation patterns, see Table 4 and Fig. 3. Table 4 shows that in each of the three samples, some methylation state pairs contribute more than others. For example, the DC methylation-state-pair contributions of spleen samples 1, 2, and 3 are significantly higher than other methylation state pairs. They are 4666.12, 1279.32, 1148.63, as shown in the fourth column of the bottom panel of Table 4.
In Fig. 3, the turquoise, dark green, and orange bars on the “by.sample” bar graphs display the primary contributions – how each methylation state pair contributes to the chi-squared statistic. The “by.sample” graphs can also indicate secondary contributions. That is, how much each individual sample contributes to the primary methylation state pair contributions. The blue, red, green and purple “by.methy.state” bar graphs display the primary contributions (per bar) – how each sample contributes to the entire chi-squared test statistic. The graphs also indicate secondary contributions within each bar. That is, how much each methylation state pair contributes to the primary sample contribution. The length of the bar or bar-segment corresponds to the amount of contribution to the overall chi-squared statistic by a specific sample or methylation state. Our analysis results show that for co-methylation patterns with methylation state pairs that start with B, C, and D, sample 1 or STL001 contributes the most (see the green or bright green bars in “B/C/D.contr.by.methy.state”). Thus, in regards to the co-methylation patterns of methylation state changes starting with B, C, and D, STL001 has the patterns that are the most different. However, for the co-methylation pattern of methylation state changes starting with A, sample 3 or STL003 is the most different one, see the last/fourth orange bar in the “A.contr.by.sample” of Fig. 3 top left panel.
Both the 3-sample chi-squared test result (i.e., small p-values in Table 3) and the above “contribution” Fig. 3 and Table 4 show that the three spleen samples have significantly different co-methylation patterns. These results seem to be contradictory with our intuitive findings shown in Fig. 2, that is, these 3 samples are not very different. This discrepancy, especially the very significant p-value may be caused by a large-count or large sample size effect, since chromosome 1 is very long, including 2,284,470 CG sites. Under the assumption that a subset of our chromosome is representative of the entire chromosome, we divide the analysis 1 count data (the left panel of Table 2) by 10s (by 1, 10, 100, and 1000). This helps to lessen the potential effect of a large sample size, as this “dividing” method creates a simulated 100, 10, 1%, and .1% of our data. For example, the STL001 AA count in Table 2 is originally 15,913, but after dividing by 10, it becomes 1591; After dividing by 100, it becomes 159.
Once we divide our method 1 data (the methylation state pair counts) by these factors of 10, we run the chi-squared test on the “modified” count data. We determine whether the p-values remain significant as the data pool gets smaller. If we see that the p-value becomes less significant as the sample size becomes smaller, we may conclude that the original p-values are not accurate due to a large sample size. The results will also tell us whether the methylation-change patterns between our three samples are significantly different. Our data for the chi-squared tests on all three spleen samples are shown in Table 5. We find that the test results become less significant as the sample size decreases. This shows that we do initially get a very small p-value because of the large sample size effect. We can determine that there is not a confirmation of a significant difference in the co-methylation patterns of three spleen samples. The patterns we observed in our Analysis 1 result may be applicable to other spleen samples, if confirmed by future studies with more normal spleen samples.
Analysis 2 results of 3S1T data
To analyze and compare how long SMRs are, we gather 6 summary values: the minimum, 1st Quartile, median, mean, 3rd Quartile, and maximum, for the number of CG sites (count) and distance in base pairs (length) for each SMR in each sample. Note, to avoid the impact of an extreme outlier, we remove the largest count/length and summarize our counts and lengths in Table 6. This table displays the distribution of count and length so that we can visualize and compare the distribution among our three spleen samples. Table 6 shows that the distributions of SMR length or CG-number count skew to the right. Most SMR are very short. The median SMR length is just about 100 to 120 base pairs for methylation state A and 300 to 450 base pairs for state D. The median count among SMRs is about 2 to 6 CG sites. Only a small proportion (< 25%) of the SMRs are longer than 1000 base pairs.
The summary in Table 6 will help us to determine whether there are any significant differences among the three samples. Table 6 shows that the right-skew patterns of the A, B, and C methylation-state count vary among three spleen samples. We also notice that the SMRs of methylation state D are an exception to Cokus et al’s paper , where they observe a correlation between methylated cytosines for distances up to 5000 bases in Arabidopsis. The D regions can reach up to 50,000 bases in a human sample. Next, we investigate more on this question: Are these distribution differences significant? We answer this question using Kruskal Wallis tests.
Comparing the distribution of the count and length of SMRs across three spleen samples
We use the Kruskal Wallis test to analyze whether there is a significant difference between the distribution of count and length of the SMRs for each methylation state of each sample. The null (H0) and alternative (Ha) hypotheses of the Kruskal Wallis test are listed below. H0: The CG count (or length) distribution of SMRs are the same among three samples. That is, there is no difference among the CG count distribution of the three samples. Ha: There are differences for the SMR CG count distribution among the spleen tissue of three samples. We use our original count data, which are summarized in Table 6, as input for each sample to conduct the Kruskal Wallis test. The test results are shown in Table 7. As displayed in this table, most of the p-values are very small, except for two p-values in the STL001vSTL002 comparison. This finding means that the three spleen samples are significantly different regarding the distribution of SMR length and the distribution of SMR CG-site count.
1S8T data analysis results
After addressing whether different samples of the same tissue have significantly different co-methylation patterns using the 3S1T data, we next show the results of investigating if different tissues of the same sample have significantly different co-methylation patterns using the 1S8T data.
Analysis 1 result of 1S8T data
We can analyze this dataset in two ways. First, we look at the overall theme for all eight tissues; we want to figure out which methylation state pair occurs most frequently. As shown in Table 8, for the bladder data, the AA methylation state pair has a percentage of 83.26% and the DD methylation state pair has a percentage of 86.54%. We notice that the different tissues display similar patterns; the AA and DD methylation state pairs have the highest percentage. We find that methylation tends to remain within the same methylation level if it is high or low. On the other hand, medium levels of methylation trend towards higher methylation states. This pattern continues throughout all eight tissues. However, there is some variation or difference among these eight tissues. We will investigate how closely these eight tissues follow the patterns we observed, so we graph percentage data for each tissue at certain distance intervals, see Fig. 4. This figure shows that there are some differences among those 8 tissues, especially for AD, BD, CD, and DD state pairs.
Next, we conduct chi-squared tests to compare the 8 tissues, see the top panel of “divide by 1” in Table 9. When comparing all eight tissues together, we find that there are statistically significant differences in the co-methylation patterns among 8 tissues. To avoid the large-count issue, we divide the input data by factors of 10: 1, 10, 100, and 1000 as we did for the 3S1T data. The division of the data by 100 seems to help the results to be more accurate. As for the division by 1000, the count (or expected count for the chi-square test) is less than 5, so we will not use this result. Instead, we use the results with the division by 100, which still shows there are small p-values, that is, the 8 tissues have significantly different co-methylation patterns.
We also try to break the chi-squared test results down into the contributions that each methylation state and/or tissue contributes to the chi-squared value. We would like to know if any one tissue/methylation state has a bigger effect on the chi-squared test, and if so, whether we should remove that tissue/methylation state as an outlier. We display our Analysis 1 chi-squared contribution using bar graphs, see Fig. 5. In this figure, we can see that the 8th tissue, thymus, contributes most to the difference. We can also see that the 4th tissue, Psoas, contributes occasionally. We remove the 8th tissue, thymus, from our dataset, and re-run the chi-squared test on our data to see if the results are more accurate. Our results are displayed in Table 10. We can see that the p-values for this test remain zero, even though we remove a tissue. So, we try to divide the chi-squared input of the seven tissues by factors of 10, similar to what we have done above, and observe the chi-squared results. The result of dividing by 100 still have significant difference among 7 tissues; however, the results of dividing by 1000 only show a significant difference for D states, but not for A, B, or C states.
Analysis 2 result of 1S8T data
We would like to analyze SMRs. We first group CG sites that are similarly methylated and determine the number or count of CG sites in each SMR and the length (in base pairs) of each SMR. We then create summaries of the CG site number count and length of SMRs, see Table 11. Similar to Table 6 (summary for 3S1T data), Table 11 shows that the distributions of SMR length or CG-number count skew to the right. Most SMRs are very short. The median SMR length is around 100 to 130 base pairs for methylation state A and 300 to 500 base pairs for state D of all tissues except thymus. The median count among SMRs is about 2 to 6 CG sites. Next we determine whether SMRs of the same methylation level have different counts/lengths.
We run the Kruskal Wallis test on all eight tissues to determine if the distribution of counts and distances of each type of SMR are significantly different. Our initial results are displayed below in Table 12. As we can see, the p-values for the count in columns C and D, and the p-values for distance in columns A, C, and D are all nearly zero. The remaining p-values are also extremely small. This shows that those 8 tissues are significantly different.
In summary, in the Result section, we have shown the results of analyzing two datasets, 3S1T and 1S8T, to address a few questions mentioned in the Introduction section.
Our research work is a specific study with a focus on the analysis of within-sample co-methylation patterns in normal DNA samples. The understanding of this specific type of methylation patterns may provide helpful information and insights on other methylation studies. These studies may include the analysis of methylation patterns for specific genes or regions (e.g., long non-coding RNAs ), identifying differential methylation [24,25,26,27], classifying large methylation data [28,29,30], integrating methylation with other data (e.g., gene expression data) [31,32,33], and analyzing pan-cancer DNA methylation [20, 34,35,36,37].
To the best of our knowledge, there is not much research work done on the WS co-methylation patterns of normal DNA samples, although there are a larger number of papers on BS co-methylation. Thus, in this paper, we conduct some preliminary analyses on WS co-methylation for normal samples. However, our research work has certain limitations. First, this study is only conducted on normal samples, but this type of within-sample analysis can be performed on other datasets, for example, cancer methylation data with multiple samples. In fact, we have submitted a paper on comparing the within-sample co-methylation between normal and cancer samples. In that paper, we find that the breast cancer sample’s co-methylation pattern is very different from the patterns reported in this manuscript. But the breast normal sample has a similar pattern as we found in the 1S8T data of the current study. Therefore, co-methylation patterns of other normal samples or other datasets could be very similar to what we showed as we used a relatively large number of different normal tissues in this paper. The second limitation is that our study is only based on basic statistical analysis. More complex probability models (e.g., Hidden Markov Models) and methods (e.g., Bayesian methods or machine learning) may be used to further investigate within-sample co-methylation. Developing a new method based these ideas and methods is beyond the scope of this paper. However, we do plan to implement these models and ideas in other projects in the near future. The third limitation is that we did not investigate the relationship between WS co-methylation and the genomic context and/or the topology of CG sites . It would be biologically meaningful to investigate in detail how WS co-methylation patterns are related to the genomic compartments, e.g., gene body, gene promoter, and CpG islands. To make it easier for other readers to interpret the co-methylation CG sites or patterns, we provide an annotation code (annotation.R) in the Additional File 1 that can report the location of identified CG sites, that is, which gene body or promoter it belongs to. Note, the Additional File 1 includes all R scripts we wrote for the analysis of this paper. In addition, it is also useful to study how these patterns are related to the density of CG sites (or CG clusters) in an entire genome and study how CG sites collaborate to maintain co-methylation patterns [38, 39]. These are all great aspects to investigate further in future research studies and we do plan to conduct research on these topics.
In order to confirm the validity of our findings, we have also conducted our analysis on the same tissues and samples from chromosome 2. We find similar patterns and can make the same conclusion. In order to avoid redundancy, we do not show the results of chromosome 2 data analysis and only show the results of analyzing chromosome 1 in this paper.
In this paper, we have conducted analyses to study the within-sample co-methylation patterns in normal DNA samples. We have investigated if the co-methylation patterns of the same tissue across several samples are different and if the co-methylation patterns of various tissues of the same sample are different. We have used two analyses methods to analyze two datasets, 3S1T and 1S8T. Based on the 3S1T data, we find there is not significant co-methylation difference among the same spleen tissues of three different samples. However, the analysis results of 1S8T data show that there were significant differences among eight tissues of one sample. For both 3S1T and 1S8T data, we find that the no/low methylation state A and high/full methylation state D tend to remain the same along a chromosome region. We also find that the low/partial methylation state B and partial/high methylation state C tend to change to higher methylation states along a chromosome. Furthermore, we find that the distribution of SMR length is skewed to the right and most SMRs are very short (with only a few hundred base pairs) and only a small proportion of SMRs are longer than 1000 base pairs. In this paper, we have addressed a few questions regarding within-sample co-methylation. Our answers and analysis results may help researchers to have a deep understanding of co-methylation patterns and thus to improve DNA methylation assays and statistical analyses for other methylation studies.
Similarly methylated region
Whole genome bisulfite sequencing
Lim DH, Maher E. DNA methylation: a form of epigenetic control of gene expression. Obstet Gynaecol. 2010;12:6.
Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet. 2012;13(7):484–92.
Bell AC, Felsenfeld G. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000;405(6785):482–5.
Goll MG, Bestor TH. Eukaryotic cytosine methyltransferases. Annu Rev Biochem. 2005;74:481–514.
Hark AT, Schoenherr CJ, Katz DJ, Ingram RS, Levorse JM, Tilghman SM. CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature. 2000;405(6785):486–9.
Kitazawa S, Kitazawa R, Maeda S. Transcriptional regulation of rat cyclin D1 gene by CpG methylation status in promoter region. J Biol Chem. 1999;274(40):28787–93.
Mancini DN, Singh SM, Archer TK, Rodenhiser DI. Site-specific DNA methylation in the neurofibromatosis (NF1) promoter interferes with binding of CREB and SP1 transcription factors. Oncogene. 1999;18(28):4108–19.
Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J Mol Biol. 1987;196(2):261–82.
Jones PA, Baylin SB. The epigenomics of cancer. Cell. 2007;128(4):683–92.
Schatz P, Dietrich D, Schuster M. Rapid analysis of CpG methylation patterns using RNase T1 cleavage and MALDI-TOF. Nucleic Acids Res. 2004;32(21):1–7.
Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452(7184):215–9.
Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet. 2006;38(12):1378–85.
Akulenko R, Helms V. DNA co-methylation analysis suggests novel functional associations between gene pairs in breast cancer samples. Hum Mol Genet. 2013;22(15):3016–22.
Busch R, Qiu W, Lasky-Su J, Morrow J, Criner G, DeMeo D. Differential DNA methylation marks and gene comethylation of COPD in African-Americans with COPD exacerbations. Respir Res. 2016;17(1):143.
Ma X, Sun PG, Zhang ZY. An integrative framework for protein interaction network and methylation data to discover epigenetic modules. IEEE/ACM Trans Comput Biol Bioinform. 2018. https://ieeexplore.ieee.org/document/8352852.
Ma X, Yu L, Wang P, Yang X. Discovering DNA methylation patterns for long non-coding RNAs associated with cancer subtypes. Comput Biol Chem. 2017;69:164–70.
Martin TC, Yet I, Tsai PC, Bell JT. coMET: visualisation of regional epigenome-wide association scan results and DNA co-methylation patterns. BMC Bioinf. 2015;16:131.
Wang F, Xu H, Zhao H, Gelernter J, Zhang H. DNA co-methylation modules in postmortem prefrontal cortex tissues of European Australians with alcohol use disorders. Sci Rep. 2016;6:19430.
Yang X, Shao X, Gao L, Zhang S. Systematic DNA methylation analysis of multiple cell lines reveals common and specific patterns within and across tissues of origin. Hum Mol Genet. 2015;24(15):4374–84.
Zhang J, Huang K. Pan-cancer analysis of frequent DNA co-methylation patterns reveals consistent epigenetic landscape changes in multiple cancers. BMC Genomics. 2017;18(Suppl 1:1045.
Hickey PF: The statistical analysis of high-throughput assays for studying DNA methylation: doctoral thesis, The University of Melbourne 2015. https://minerva-access.unimelb.edu.au/handle/11343/55699.
NIH Common Fund Epigenomics Program. http://commonfund.nih.gov/epigenomics.
Harris EY, Ponts N, Le Roch KG, Lonardi S. BRAT-BW: efficient and accurate mapping of bisulfite-treated reads. Bioinformatics. 2012;28(13):1795–6.
Shafi A, Mitrea C, Nguyen T, Draghici S. A survey of the approaches for identifying differential methylation using bisulfite sequencing data. Brief Bioinform. 2017;19(5):737–53.
Sun S, Yu X. HMM-fisher: identifying differential methylation using a hidden Markov model and Fisher's exact test. Stat Appl Genet Mol Biol. 2016;15(1):55–67.
Yu X, Sun S. Comparing a few SNP calling algorithms using low-coverage sequencing data. BMC Bioinf. 2013;14:274.
Yu X, Sun S. HMM-DM: identifying differentially methylated regions using a hidden Markov model. Stat Appl Genet Mol Biol. 2016;15(1):69–81.
Celli F, Cumbo F, Weitschek E. Classification of large DNA methylation datasets for identifying Cancer drivers. Big Data Res. 2018;13:21–8.
Weitschek E, Cumbo F, Cappelli E, Felici G, Bertolazzi P. Classifying big DNA methylation data: a gene-oriented approach. Commun Comput Inform Sci. 2018;903:12.
Zhao Z, Pompili D. Walsh-hadamard transform of DNA methylation profile for the classification of human cancer cells. In: ICBCB '17 proceedings of the 5th international conference on bioinformatics and computational biology; 2017. p. 4.
Cappelli E, Felici G, Weitschek E. Combining DNA methylation and RNA sequencing data of cancer for supervised knowledge extraction. BioData Min. 2018;11:22.
Li C, Lee J, Ding J, Sun S. Integrative analysis of gene expression and methylation data for breast cancer cell lines. BioData Min. 2018;11:13.
Ma X, Liu Z, Zhang Z, Huang X, Tang W. Multiple network algorithm for epigenetic modules via the integration of genome-wide DNA methylation and gene expression data. BMC Bioinformatics. 2017;18(1):72.
Saghafinia S, Mina M, Riggi N, Hanahan D, Ciriello G. Pan-Cancer landscape of aberrant DNA methylation across human tumors. Cell Rep. 2018;25(4):1066–1080 e1068.
Tang B. DMAK: a curated pan-cancer DNA methylation annotation knowledgebase. Bioengineered. 2017;8(2):182–90.
Witte T, Plass C, Gerhauser C. Pan-cancer patterns of DNA methylation. Genome Med. 2014;6(8):66.
Yang X, Gao L, Zhang S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief Bioinform. 2017;18(5):761–73.
Lovkvist C, Dodd IB, Sneppen K, Haerter JO. DNA methylation in human epigenomes depends on local topology of CpG sites. Nucleic Acids Res. 2016;44(11):5123–32.
Haerter JO, Loevkvist C, Dodd IB, Sneppen K. Collaboration between CpG sites is needed for stable somatic inheritance of DNA methylation states. Nucleic Acids Res. 2014;42(4):2235–44.
This project was completed with the use of Texas State University facilities and resources. This work was also supported by Dr. Sun’s Texas State University Research Enhancement Program (REP) grant. We appreciate that Brittany Sue Enfield helped us find the data used in this paper when she was working on her honors thesis with Dr. Sun at Texas State University.
This project is supported by the Texas State University Research Enhancement Program.
Availability of data and materials
Datasets used in this paper are publicly available with detailed information shown below. In particular, we provide the web links for the raw sequencing data, on which all SRA (Short Read Archive) files are available, see below. These links are useful for readers like us, who would like to process raw sequencing reads to obtain methylation signals by themselves. Besides the web links of SRA files, we also provide GEO (Gene Expression Omnibus) web links, on which the processed methylation signal datasets (e.g., hg19 version wig format files) are available now or may be available later. Note, for some of the GEO web links, the processed methylation signals are not available yet. When we search the processed methylation signal data, we find that the original investigators who generated these datasets mentioned that “various levels of processed data files will be made available as this project proceeds”. Therefore, readers who are interested in the processed methylation data may check the GEO web sites later. In addition, readers can also visit http://www.encodeproject.org, click “Experiments”, and then select related search items (e.g., WGBS and spleen) to find more information about a specific dataset.
# 3S1T data (3 spleen tissue of STL001, STL002, and STL003)
SRX388737: Whole Genome Shotgun Bisulfite Sequencing of Spleen Cells from Human STL001
SRX388746: Whole Genome Shotgun Bisulfite Sequencing of Spleen Cells from Human STL002
SRX175355: Whole Genome Shotgun Bisulfite Sequencing of Spleen Cells from Human STL003
# 1S8T data (8 tissues of STL001)
SRX213279: Whole Genome Shotgun Bisulfite Sequencing of Bladder Cells from Human STL001
SRX388733: Whole Genome Shotgun Bisulfite Sequencing of Gastric Cells from Human STL001
SRX388734: Whole Genome Shotgun Bisulfite Sequencing of Lung Cells from Human STL001
SRX388735: Whole Genome Shotgun Bisulfite Sequencing of Psoas Cells from Human STL001
SRX175348: Whole Genome Shotgun Bisulfite Sequencing of Sigmoid Colon Cells from Human STL001
SRX175349: Whole Genome Shotgun Bisulfite Sequencing of Small Bowel Cells from Human STL001
SRX388737: Whole Genome Shotgun Bisulfite Sequencing of Spleen Cells from Human STL001
SRX190151: Whole Genome Shotgun Bisulfite Sequencing of Thymus Cells from Human STL001
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
All R scripts for the analysis of this paper. (ZIP 20 kb)