How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?
BioData Mining volume 5, Article number: 6 (2012)
Next-generation sequencing technologies generate a significant number of short reads that are utilized to address a variety of biological questions. However, quite often, sequencing reads tend to have low quality at the 3’ end and are generated from the repetitive regions of a genome. It is unclear how different alignment programs perform under these different cases. In order to investigate this question, we use both real data and simulated data with the above issues to evaluate the performance of four commonly used algorithms: SOAP2, Bowtie, BWA, and Novoalign.
The performance of different alignment algorithms are measured in terms of concordance between any pair of aligners (for real sequencing data without known truth) and the accuracy of simulated read alignment.
Our results show that, for sequencing data with reads that have relatively good quality or that have had low quality bases trimmed off, all four alignment programs perform similarly. We have also demonstrated that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners as well, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. Trimming off low quality ends significantly increases the concordance between Novoalign and other aligners. As for aligning reads from repetitive regions, our simulation data show that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy.
This study provides a systematic comparison of commonly used alignment algorithms in the context of sequencing data with varying qualities and from repetitive regions. Our approach can be applied to different sequencing data sets generated from different platforms. It can also be utilized to study the performance of other alignment programs.
The great demand for efficient, inexpensive, and accurate sequencing has driven the development of high-throughput sequencing technologies from automated Sanger sequencing to next-generation sequencing (NGS) over the past several years. Currently, NGS technologies are capable of producing low-cost data on a giga base-pair scale in a single run, which usually includes millions of sequencing reads. This ability makes the NGS technology a powerful platform for various biological applications, such as genetic variant detection by whole-genome or target region resequencing, mRNA and miRNA profiling, whole transcriptome sequencing, ChIP-seq, RIP-seq and DNA methylation studies. The first step of nearly all these applications is to align sequencing reads onto a reference genome. Thus, in order to obtain any further genetic information from sequencing data, the requirement of fast and accurate alignment tools has to be a priority .
In parallel with the rapid growth of new sequencing technologies, many alignment programs [2–20] have been developed, including MAQ, Novoalign (http://www.novocraft.com), SOAP, Bowtie, and BWA. Among all these five aligners, MAQ is the only one that indexes the reads, while all other aligners build indexes on a reference genome. In terms of the indexing algorithms they adopt, MAQ and Novoalign are two alignment programs that build an index with a hash table. To identify inexact matches in short-read alignments, MAQ uses a split strategy while Novoalign adopts an alignment scoring system based on the Needleman-Wunsch algorithm . SOAP2 employs a similar split strategy as MAQ in identification of inexact matches. Instead of using a hash table, SOAP2 adopts the FM-index algorithm  to build an index, which greatly reduces the alignment time for substrings with multiple identical copies. Bowtie and BWA are two other alignment programs developed based on the FM-index method that uses a backtracking strategy to search for inexact matches. These programs serve as relatively efficient and accurate tools in aligning large number of reads, and greatly extend the scale and resolution of sequencing technology applications.
New challenges for alignments have arisen from applying sequencing technologies to address different biological questions. For example, how do reads with various sequencing qualities affect alignment results? How do they deal with the reads that can be mapped to multiple locations on a reference genome? In order to answer these questions, we select four commonly used aligners (SOAP2, Bowtie, BWA, and Novoalign), and conduct a systematic analysis to evaluate the performance of these programs. First, we review and compare the algorithms these alignment programs employ as well as their advantages with respect to the major options they provide. Then, we use two sets of real Illumina sequencing data and two sets of simulated data to study how different alignment programs perform on sequencing data with varying quality and from repetitive regions. The performance is measured in terms of 1) concordance between any pair of the aligners, and 2) accuracy in simulated read alignment. We have demonstrated that, for sequencing data with reads that have relatively good quality or have had the low quality bases trimmed off, all four alignment programs perform similarly. Furthermore, we show that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners as well, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. As for aligning reads from repetitive regions, our simulated data show that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy.
Reviewing the features of alignment programs
Hash table and suffix tree are two major indexing algorithms that current alignment programs use. Hash table indexing, which was first introduced into the field of alignment by BLAST , keeps the positions of k-mer query subsequence as keys, and then searches for the exact match of the keys in reference sequences. It consumes less space since it builds an index for positions of sequences instead of the sequences themselves. Among different suffix tree algorithms, FM-index is based on the Burrows-Wheeler transforms (BWT) . BWT is a reversible permutation of characters in a text. It transforms the original character string into a more compressed format, where the same characters are placed side by side as a cluster, rather than in a scatter pattern. Out of the four alignment programs we are interested in, Novoalign adopts a hash table algorithm, while SOAP2, Bowtie, and BWA adopt the FM-index (Table 1).
To find inexact matches, alignment programs allow a certain number of mismatches using different strategies (Table 1). SOAP2 uses a split-read strategy to allow at most two mismatches. A read will be split into three fragments, such that the mismatches can exist in, at most, two of the three fragments at the same time. Bowtie uses a backtracking strategy to perform a depth-first search through the entire space, which stops until the first alignment that satisfies specific criterion is found . Similar to Bowtie, BWA also adopts a backtracking strategy to search for inexact matches. However, the search in BWA is bounded by the lower limit of number of mismatches in the reads. With this limit better estimated, BWA is able to define a smaller search space, and thus make the algorithm more efficient . Moreover, BWA provides a mapping quality score for each read to indicate the Phred-scaled probability of the alignment being incorrect. This mapping quality score incorporates base qualities, number of mismatches, and the repeat structure. The higher the mapping quality score, the more accurate an alignment is. A zero will be assigned if a read is aligned to at least two locations with equal probabilities. On the other hand, Novoalign first finds candidate alignment positions from the reference genome for each read, and calculates alignment scores for these positions using the Needleman-Wunsch algorithm, based on base qualities, the existence of gap, and ambiguous codes (Ns). This alignment score is 10log10(q), where q represents the probability of observing the read sequence given a specific alignment location. This score corresponds to the parameter setting “-t” at the command line when running Novoalign which finds the best alignment with the lower score and any other alignments with similar scores. Because of this alignment-score-based search algorithm, users cannot define the number of allowed mismatches in each alignment, but they can set up a threshold of alignment scores.
We also summarize the major options that the four alignment programs provide (Table 2). SOAP2, Bowtie, BWA, and Novoalign all allow pair-end alignments, enable the identification of the best alignment, and incorporate certain ways of trimming low quality bases (Table 2). There are some characteristics unique to certain aligners. For example, in BWA, the maximum number of allowed mismatches is sensitive to the length of reads. If less than 4% of m-long reads with 2% uniform base error rate have more than k mismatches, then the maximum number of allowed mismatches in these reads is set to be k. Thus, for our simulated data with 50-bp-long reads, k = 3. For the real NGS data with 68-bp-long reads, k = 4 (Table 2). This number may vary depending on the length of reads after trimming. Unlike the other programs, Novoalign does not allow the users to define the number of allowed mismatches. However, this parameter can be set indirectly by defining the threshold of the alignment score. In practice, setting the threshold at ‘-t 60’ will be approximately equivalent to allowing two mismatches at high quality base positions and maybe one mismatch at a low quality position.
Alignment performance evaluation
In order to examine how the four selected alignment programs (i.e., SOAP2, Bowtie, BWA, and Novoalign) perform on real sequencing data with varying quality, we use two single-end Illumina sequencing data sets (S1 and S2). S1 and S2 are sequenced from human colon cancer samples. For each of these two samples, about 3000 exons selected from cancer related genes are captured and sequenced by the Illumina sequencer, with 7,406,247 and 5,398,566 68-bp-long single-end reads generated respectively. We process the reads with ShortRead package inside of Bioconductor (http://www.bioconductor.org) to evaluate the bases quality. The plot of base qualities suggests that S1 has an overall better quality than S2. For example, S2 has more low quality bases at the 3’ end. In particular, the last 10 bases have average quality score less than 20 (Figure 1).
In order to examine how the four selected alignment programs perform on sequencing data obtained from repetitive regions, we simulate two sets of data from human genome 18: 1) 138771 50-bp-long reads are generated from about 3000 exon regions from which the real data sets S1 and S2 are generated, and these 3000 exon regions do not have many repetitive regions; 2) 55018 50-bp-long reads are generated from 218 CpG islands. These 218 CpG islands are selected from 28226 CpG islands along the whole genome, by the criteria that each chosen CpG island must have at least 25% repetitive bases, and these repetitive bases must be at least 50 bp in length. Note that the purpose of this article is not to simulate reads from all repetitive regions in a genome. It is difficult to precisely define repetitive regions in a genome. Therefore, we simply choose to use some CpG islands that have some level of repetitive regions.
In order to mimic the situation that one or two single bases are mismatches due to pure sequencing errors or true novel single nucleotide variants (SNV), we design the following four scenarios in our simulation data:
randomly set one mismatch for each read and let all bases have high quality;
randomly set two mismatches for each read and let all bases have high quality;
randomly set one low quality mismatch for each read and let all other bases have high quality;
randomly set two low quality mismatches for each read and let all other bases have high quality.
and 2) are the cases that the mismatches are likely due to the existence of novel SNVs. 3) and 4) are the cases that the mismatches are likely due to pure sequencing errors. Low quality bases are the ones with Phred quality score ranging from 5-15; high quality bases are the ones with Phred quality score ranging from 30-40. Phred quality score is defined as -10*log10(p), where p is the base-calling error probability.
Trimming, alignment, and evaluation
In order to evaluate how different alignment programs perform in sequencing reads with low quality ends, we use the four alignment algorithms to map S1 and S2 before and after trimming off the low quality bases using BRAT trim , respectively. In particular, we set the parameters of BRAT as trimming from both the 5’ and 3’ ends until reaching a base with quality score higher than 20, and allowing at most two Ns in each read. The length of the trimmed reads is at least 24 bases, and the majority of them are larger than 50 bases. We then perform alignments against the human genome 18 on trimmed and non-trimmed S1 and S2 data using SOAP2, Bowtie , BWA , and Novoalign (http://www.novocraft.com), respectively. For the purpose of this study, we set the parameters in all four alignment programs as follows: 1) At most two mismatches are allowed in SOAP2, Bowtie, and BWA for each alignment. Due to the different alignment searching algorithm that Novoalign uses, we set the parameter t at 60 to allow approximately up to two mismatches, and then choose the alignment reads with no more than two mismatches using the NM tag in the output. 2) Randomly report one alignment for each read, or only report reads with unique alignments. For each alignment result we calculate the percentage of aligned reads. The performance of four alignment programs is measured in terms of concordance between any pair of aligners because no known truth for real sequencing data is available. In particular, for each pair of aligners, aligned reads are assigned into four classes as follows:
Class 1: a read is aligned to the same location by both aligners that we are comparing (e.g., aligner 1 and aligner 2);
Class 2: a read is aligned to different locations by both aligners;
Class 3: a read is only aligned by one of the two aligners (e.g., aligner 1);
Class 4: a read is only aligned by the other aligner in a comparison pair (e.g., aligner 2).
If two alignment algorithms perform similarly, there should be a relatively small number of reads in class 2, 3 and 4 as shown in Figure 2.
In order to evaluate how different alignment algorithms perform on data containing reads generated from regions with more repetitive sequences, we use two simulated data sets. One data set is simulated from the 3000-exon regions that do not have a lot of repetitive bases and the other one is from 218 selected CpG islands that have many repetitive bases. For both simulated data sets, we align these reads using the four selected alignment programs. While aligning these simulated reads, all parameters are set the same as the ones used in real NGS data, except that we allow one mismatch for those data sets with only one mismatch simulated. Because we know the position from which each simulated read is generated, the performance of the four alignment tools is measured in terms of the accuracy of simulated read alignment. We define a true alignment as a situation when a read is aligned back into the same position from which it was generated. In addition, a false alignment is defined as a read that is aligned to other positions rather than the one from which it was generated.
Benchmark of aligners
To assess the speed of index building and reads mapping in these four aligners, we use the non-trimmed S1 data, which has 7.4 million 68-bp-long single-end reads. We align these reads using the human genome 18 as a reference, with at most two mismatches allowed and one alignment randomly reported for each read (Table 3). Novoalign is extremely fast (4.02 min) at index building, while the other three take more than one hour to finish the same job (Table 3). As for reads mapping, SOAP2 and Bowtie have a similar number of reads mapped although SOAP2 takes 6 minutes less than Bowtie. BWA maps 76.12% of all reads, which is slightly more than SOAP2 and Bowtie, within 26.4 minutes. Novoalign, on the other hand, is much more time-consuming. It takes 62.9-minute CPU time to align 73.64% of reads in single-end mode.
Aligners’ performance on sequencing data with different qualities
For the data set S1 that has relatively good quality, all four aligners generally show a good concordance, without trimming off low quality bases. A similar number of reads is aligned by each aligner (Table 4). Over 95% of the reads are assigned into class 1 (i.e., more than 95% of reads are aligned to the same locations by both aligners) when comparing SOAP2, Bowtie, and BWA, pairwise, while Novoalign shows slightly less agreement (84-88%) with the other three aligners (Table 5). However, for the S2 data set that has very low quality bases at many reads, the comparison results are quite different. In non-trimmed data set S2, when Novoalign is compared with any of the other three aligners, less than 50% of reads are assigned to class 1 (i.e., less than 50% of reads are aligned to the same locations by both aligners), but 15% of the reads are assigned to class 2 (i.e., 15% of reads are aligned to different locations by two aligners), and over 30% are assigned to classes 3 or 4 in total (i.e., about 30% of reads are aligned by only one of the two aligners), respectively (Table 6). That means for Novoalign and any other aligner (SOAP2, Bowtie, or BWA), only 50% of all aligned reads are mapped by both of them. This inconsistency of Novoalign’s performance in different data sets might result from the fact that S2 has overall lower quality than the S1 data set (Figure 1). To further investigate the effect of sequencing quality, we trim both S1 and S2 data sets with BRAT trim, and then do alignment using the four aligners.
Performing trimming on NGS data not only cuts off the low quality bases from both ends, but also discards poor quality reads, and thus improving reads’ quality markedly. After trimming, 399,442 (5.4%) and 204,911 (3.8%) reads are discarded from S1 and S2 data, respectively. With slightly fewer reads available for alignment, however, the number of aligned reads is increased by 15-17% in the S1 data, and 34-42% in the S2 data, for all four-alignment programs. This apparent difference in the magnitude of increase indicates that trimming has a greater effect on the S2 data set than on the S1 data set. Another interesting observation is that, Novoalign aligns 42% more reads in trimmed S2 than non-trimmed S2, while this increment in the other three aligners is only about 35%, suggesting that data quality improvement has a larger effect on Novoalign.
By trimming off the reads before alignment, we observe a substantial increase in the number of reads that fall into class 1 in all pair-wise comparisons, in both S2 and S1 data sets (Tables 5, 6, 7, and 8). That is, more reads are aligned to the same locations by the comparison pair. This increase indicates an improved concordance among the four aligners. Moreover, trimming appears to have a greater effect on S2, a data set with lower quality, than the S1 data set. In the pair-wise comparisons between Novoalign and any of the other three aligners for S2 data set, the number of reads assigned to the first class increases almost 3-fold, (1.2 million vs. 3.7 million), while the number of reads that are only aligned by the opponents of Novoalign become markedly less (see class 2 of Table 8), compared to non-trimmed alignments (see class 2 of Table 6). On the other hand, in the S1 data set, trimming only improves the agreement between Novoalign and the other three aligners by 8-10% (Tables 5, 7). This differentiation in the magnitude of concordance improvement, along with the fact that performing trimming leads to a more significant improvement in reads’ quality for S2 data set, further indicates that Novoalign is more sensitive towards the changes in sequencing quality.
Aligners’ performance on reads with multiple alignments
To evaluate these aligners in terms of their performance on reads with multiple alignments, we set the alignment parameters in two different ways: (1) randomly report one alignment for each read and (2) only report the read with a unique position (suppress reads that can be aligned to multiple locations). Compared to the former strategy, the latter discards around 4-10% of aligned reads from S1 and 2.5-8% in S2 (see Table 4).
In pair-wise comparisons among all four aligners, we find that in both S1 and S2 data sets, suppressing multiple alignments decreases the number of reads aligned to different positions (class 2) in all comparison pairs, while the number of reads aligned to same positions (class 1) stays the same (Tables 5, 6, 7, 8). Reads with multiple alignments are more likely to be aligned to different locations by different aligners, due to the difference in alignment strategies these aligners employ, as well as the standards of how to randomly choose one alignment to report. Therefore, the number of reads assigned to class 2 during comparison is reduced by suppressing multiple alignments. Next, we will use simulated data to investigate further.
Aligners’ performance on simulated data
In order to study the four aligners’ performance on reads from repetitive regions, we use the two sets of simulated data as mentioned in the Data Set subsection. One data set is simulated from 3000 exon regions that do not have many repetitive bases. The other data set is from 218 selected CpG islands that have a lot of repetitive bases. In these two simulated data sets, no matter whether the mismatch positions are designed to have high or low quality, all four aligners show a lower false alignment rate in the data set generated from 3000 exon regions (0.7-5%, see Table 9A, B) compared to the data set generated from 218 CpG islands that have more repetitive regions (14-17%, see Table 10A, B). Since the reads from regions with repetitive bases have a much higher probability of being aligned onto multiple locations, we can predict that suppressing multiple hits can help to diminish the false alignments caused by repetitive bases. As expected, the alignment accuracy in CpG island simulation data is substantially improved by suppressing multiple alignments (Table 10A, B).
By assigning mismatches with high quality, we mimic the true novel variants that are more likely to have better quality. By assigning mismatches with low quality, we mimic the pure sequencing errors. In both cases, SOAP2, Bowtie, and BWA are found to have similar false alignment rates no matter whether the alignment report is randomly reporting one alignment or suppressing reads with multiple hits (Tables 9 and 10). However, Novoalign exhibits higher false alignment rates compared to the other three aligners.
Trimming off the low quality ends of reads improves their quality, and thus improves their alignment results. Although the number of reads available for alignments decreases after trimming, we still observe an increase in the number of successfully aligned reads as well as in the concordance among aligners. S1, with a higher mean and a smaller deviation of base quality score, clearly has better quality than S2 (Figure 1). Thus, it is predictable that trimming has a greater effect on the S2 data set than on the S1 data set, which has been shown by our data analysis. Having a lower quality at the 3’ end is a commonly observed problem in single-end sequencing data, especially in the early version of the Illumina sequencer. By trimming, which only takes a few minutes to process for a data set with several million reads, users can benefit greatly. For example, more information can be extracted from the data since more reads will be aligned after trimming. With the improvement in alignment quality and quantity seen here, we recommend trimming prior to any alignment and downstream analysis, especially for poor quality data.
In the better quality data set S1, Novoalign performs similarly to SOAP2, Bowtie, and BWA, no matter which set of parameters we use. However, in the lower quality data set S2, Novoalign shows patterns that are different from the other three aligners. For example, Novoalign aligns more reads than the others and shows a greater increase in the number of aligned reads after trimming (Table 5). This might be due to the differences in alignment algorithms between Novoalign and the others. As we have shown, in SOAP2, Bowtie, and BWA, the alignment strategy is restrained by the number of mismatches allowed. That means users can specify the number of mismatches they prefer for any alignment process to obtain optimal results for their purpose. Unlike the other three, Novoalign uses an alignment score as a criterion. This alignment score is calculated based upon the base qualities, the existence of gaps, and the ambiguous codes for the entire read. For Novoalign, setting the threshold of the alignment score “-t” at 60 in the command line ensures that only the alignments with an alignment score of no more than 60 are reported, which is approximately equivalent to allowing two mismatches in alignment. However, this is only the case when the quality of reads is within a reasonable range. When applying these aligners to poor quality data sets, such as S2, Novoalign may become more sensitive to the quality and therefore show quite different results as compared to SOAP2, Bowtie, and BWA. After trimming off the low quality ends, the quality of reads has been improved. Thus, the Novoalign results become more similar to the others.
Since the alignment results may be sensitive to the choice of the alignment score threshold, especially for the lower quality data S2, we explore the impact of this parameter “-t” in Novoalign by setting it at different values: default, 60, 70, and 75. For both S1 and S2 data sets, ‘default value’ decreases the concordance of Novoalign with other aligners dramatically; using 70 and 75, the concordance of Novoalign and other aligners is similar as the one using 60. Therefore, we conclude that the pattern of lower concordance of Novoalign with others in a poor quality data set is not due to improper parameter choice.
Other than Novoalign, Bowtie also allows users to have the option of considering the qualities of mismatches. It enables users to set the maximum permitted total of quality values at all mismatched positions throughout the entire alignment (i.e., the “-e” option when setting parameters to run Bowtie). To investigate this parameter setting in Bowtie, we both allow 2 mismatches and set the parameter “–e” at 20, 40, 60, and 80 respectively (data not shown). For our data sets, when the “-e” parameter is set at 40, 60 and 80, there are nearly identical results as compared to the output from only setting the number of allowed mismatches at 2 (i.e., “-v 2”). But setting “–e” at 20 shows severe departures from other three aligners. In our data sets, most reads have moderate to good quality scores. However, setting “–e” at 20 only allows extremely low quality mismatched positions, and therefore, rules out the majority of reads with high quality mismatched positions.
Like trimming off the reads, suppressing multiple alignments also improves the consistency among the three aligners (Table 6). Out of the multiple locations of the reference genome that one read can be aligned onto, only one is true. Even though all aligners can choose one alignment for each read, based on a certain standard, there is no guarantee that the one they choose represents the true location. Thus, eliminating all reads having multiple alignments will help improve the accuracy of alignments and also the consistency among the four aligners. Our analysis resulting from the S1 and S2 data sets supports this conclusion. We design one simulated data set that contains many repetitive bases. By eliminating reads with multiple alignments, the false alignment rate decreases to almost 0 for SOAP2, Bowtie, and BWA, and below 9% for Novoalign (Table 10).
In addition to the trimming and initial parameter setting of aligners, we also investigate the impact of filtering the alignments based on the mapping quality score provided in the output files of different aligners. Out of the four aligners, BWA and Novoalign both have a mapping quality score reported for each alignment. For BWA, this score is approximately a Phred-scaled probability of the alignment being incorrect, which takes the values of 37, 25, and any value between 23 and 0. In general, a score of 37 means the read is aligned to a unique position with less than 2 mismatches; a score of 25 means the read is aligned to a unique position with 2 mismatches; a score between 23 and 0 means the read is aligned to multiple locations, such that a lower score means that the mapped location is less accurate (based on BWA source code). For Novoalign, the mapping quality score is the probability of the alignment given the read and genome, which ranges from 0 to 150. Higher scores mean better alignment qualities. To explore the effect of quality score filtering, we checked the mapping quality scores in the untrimmed S1 and S2 data with one alignment reported randomly (Figure 3). The distribution of scores shows that both aligners yield alignments with high mapping quality scores. For Novoalign, the majority of reads have a mapping quality score of 150 (Figure 3A and B), which is the upper limit of the score. While for BWA, the majority of reads have a score of 37 or 25 (Figure 3C and D), which means each of them is explicitly aligned to a unique position with 0 to 2 mismatches. A small fraction of reads have scores between 23 and 0. These reads are generally mapped to multiple locations in the reference genome. Therefore, quality score filtering wouldn’t show much impact on the concordance among aligners in the real data sets. In addition, since SOAP2 and Bowtie do not have alignment quality scores in their respective output files, to ensure a relatively fair comparison, no alignment quality filters are used.
As for the mapability of those target regions that we used in our simulation data, we have checked the mapability using the “Duke uniqueness 35 bp” method provided by the UCSC genome browser for the 218 CpG islands and 3000 exon regions. This Duke method reports a mapability score between 0 and 1, with 1 representing a completely unique sequence. A score of 0.5, 0.3, 0.25, or 0 represents that the sequence occurs twice, three times, four times, or more than four times, respectively. For the 218 CpG islands, 80.09% are completely unique, which means all 35-bp sequences within these islands occur only once in the genome; while 19.91% are not completely unique, which means at least one 35-bp sequence within each of these islands occurs more than once in the genome. The median mapability score of all CpG islands is 1 and the mean is 0.9830. For the 3000 exon regions, 95.40% are completely unique and 4.60% are not completely unique. The median mapability score of all regions is 1 and the mean is 0.9930. Generally speaking, the 3000 exon simulation data has better mapabilty than the 218 CpG island data.
There are different ways to evaluate the current available programs. For example, Ruffalo et al. developed a simulation and evaluation suite to compare a few available aligners only using simulated data . In this article, we focus mainly on comparing them from two specific angles (i.e., using real reads with varying qualities and simulated reads from repetitive regions). Thus, there are a few limitations in the article. First, rather than from the whole human genome, both the real data and the simulated data are from part of it. Second, our sequencing data sets are only from the Illumina sequencer. Third, we mainly use single-end sequencing data without considering pair-end data. Fourth, there are many other great alignment algorithms [2, 4–10, 12, 14, 19, 20] that we did not compare. Although this article has these limitations, our approach is very general, and it can be applied to the pair-end whole genome real and simulated sequencing data as well as data generated from other platforms. It can also be utilized to study the performance of other alignment programs with some minor modifications if necessary.
In order to study how alignment programs perform on data with varying quality and from repetitive regions, we have evaluated the performances of four commonly used alignment programs—SOAP2, Bowtie, BWA, and Novoalign—on two real NGS data sets and two simulated data sets. Our results show that, for sequencing data with reads that have relatively good quality or have had the low quality bases trimmed off, all four alignment programs perform similarly. We have also demonstrated that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. Trimming off low quality ends increases the concordance between Novoalign and the others significantly. Therefore, the quality of sequencing data has a great impact on alignment result, and we highly recommend assessing sequencing quality first and then trimming off low quality base if necessary. As for aligning reads from repetitive regions, our simulation data shows that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy.
Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010, 11 (5): 473-483. 10.1093/bib/bbq015.
Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002, 18 (3): 440-445. 10.1093/bioinformatics/18.3.440.
Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics. 2008, 24 (5): 713-714. 10.1093/bioinformatics/btn025.
Jiang H, Wong WH: SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008, 24 (20): 2395-2396. 10.1093/bioinformatics/btn429.
Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.
Lin H, Zhang Z, Zhang MQ, Ma B, Li M: ZOOM! Zillions of oligos mapped. Bioinformatics. 2008, 24 (21): 2431-2437. 10.1093/bioinformatics/btn416.
Schatz MC: CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009, 25 (11): 1363-1369. 10.1093/bioinformatics/btp236.
Chen Y, Souaiaia T, Chen T: PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics. 2009, 25 (19): 2514-2521. 10.1093/bioinformatics/btp486.
Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M: SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol. 2009, 5 (5): e1000386-10.1371/journal.pcbi.1000386.
Weese D, Emde A-K, Rausch T, Döring A, Reinert K: RazerS—fast read mapping with sensitivity control. Genome Res. 2009, 19 (9): 1646-1654. 10.1101/gr.088823.108.
Li R, Yu C, Li Y, Lam T, Yiu S, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-10.1093/bioinformatics/btp336.
Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA. Bioinformatics. 2008, 24 (6): 791-797. 10.1093/bioinformatics/btn032.
Li H, Durbin R: Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093/bioinformatics/btp698.
De Bona F, Ossowski S, Schneeberger K, Ratsch G: Optimal spliced alignments of short sequence reads. BMC Bioinforma. 2008, 9 (Suppl 10): O7-10.1186/1471-2105-9-S10-O7.
Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.
Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.
Ning Z, Cox AJ, Mullikin JC: SSAHA: A Fast Search Method for Large DNA Databases. Genome Res. 2001, 11 (10): 1725-1729. 10.1101/gr.194201.
Harris E, Ponts N, Levchuk A, Roch K, Lonardi S: BRAT: bisulfite-treated reads analysis tool. Bioinformatics. 2010, 26 (4): 572-10.1093/bioinformatics/btp706.
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O: Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009, 41 (10): 1061-1067. 10.1038/ng.437.
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, Sahinalp SC: mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Meth. 2010, 7 (8): 576-577. 10.1038/nmeth0810-576.
Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4.
Ferragina P, Manzini G: Opportunistic data structures with applications. Proceedings of the 41st Annual Symposium on Foundations of Computer Science (FOCS 2000). 2000, IEEE Computer Society, IEEE Computer Society
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410.
Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Technical Report 124. 1994, Digital Equipment Corporation, Palo Alto, CA
Ruffalo M, LaFramboise T, Koyut M: Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics. 2011, 27 (20): 2790-2796. 10.1093/bioinformatics/btr477.
The authors declare no competing interests.
XY performed all statistical analyses and drafted the manuscript. KG, JW, MV, ZW, SM and MA were involved in the data collection and helped in preparation of the manuscript. SS provided suggestions on the project and revised the manuscript. All authors have read and approved the final document.
About this article
Cite this article
Yu, X., Guda, K., Willis, J. et al. How do alignment programs perform on sequencing data with varying qualities and from repetitive regions?. BioData Mining 5, 6 (2012). https://doi.org/10.1186/1756-0381-5-6
- Next generation sequencing
- Sequencing quality