The master index file, sequence.index, which describes all the current 1000 Genomes Project data was down loaded [13]. As of 8 February 2013 there were 47,315 scans available (a further 208 had been withdrawn). They comprised: 39 736 paired-end and 4822 single ended DNA sequence scans plus a further 1611 (paired end) and 938 (single ended) scans which used ABI_SOLID colorspace encoding. 4058 were randomly chosen and down loaded. All the DNA measurements are in fastq format, so they include a quality score per DNA base pair. Each scan contains DNA sequences of the same length. Figure 4 shows the distribution of DNA sequence lengths. Almost all colorspace sequences contain 25, 35 or 50 base pairs, whereas lengths 68, 76, 100 and 101 dominate non-colorspace sequences.
On average: each scan contained 13 million DNA sequences (or pairs of sequences). Even compressed, each file is approximately a gigabyte. (Compression reduces download size by a factor of about 3.1) Paired end scans need two such files. The down load speed was variable, typically between 2.5 106 and 36 106 bytes/second, with a mean of 11 million bytes per second. In total 7547 files were down loaded (6.0 terabytes) containing 51 494 393 834 DNA measurements totalling about 7.5 1012 base pairs.
We then used Bowtie [8] to find those DNA measurements (i.e. DNA sequences or pairs of DNA sequences) which matched one or more of the published Mycoplasma genomes but do not match the reference human genome GRCh37.p5. See Figure 10. We used all of the Mycoplasma genomes available from NCBI (30 in total. See Additional file 1: Table S1). Apart from using multiple threads -p8, Bowtie’s defaults were used through out. The Bowtie EBWT databases for the normal and colorspace Mycoplasma genomes are both 36 MBytes. Despite including 30 species, due to the small size of Mycoplasma genomes, they are both considerably smaller than that for the two for the human reference genome, which are 2.9 GB for both normal and colorspace. The Bowtie EBWT databases and colorspace databases for the human reference genome GRCh37.p5 include all sequences. I.e., as well as chromosomal DNA, they both include human mitochondrial, “unlocalized” and “unplaced” sequences.
Notice (Figure 10) Bowtie is usually faster on single ended rather than paired double ended DNA sequences (mean 28 v. 18 million sequences per hour per CPU). Although downloading and decompressing the files took 37% of the elapsed time, despite using all 8 CPU cores, almost all the remaining 63% of time was used by Bowtie.
Estimating entropy
In statistical mechanics, entropy is the degree of disorder in a system [14]. In information theory this translates to the degree or randomness or incompressibility of data, particularly in transmission of messages [15]. , where p is the probability of a sequence of symbols and we sum over all possible symbols. For replicability, the remainder of this section details how we approximate entropy using actual DNA base counts in finite sequences.
In order to have entropy expressed in bits we use log2.
A reasonable estimate of the compressibility of variable length DNA sequences can be made by considering all loss-less coding schemes of up to four bases. The most efficient coding scheme gives the most compressible output. For example, a long sequence of adenine (AAAAAAAAAAAAAAAAAAAA...) can be recoded as a shorter sequence of 00000..., where 0 is one of the new 256 codes needed to represent AAAA–TTTT. Since the coding is loss-less, the encoded sequence contains the same information and so it has the same entropy.
We approximate probability p by the actual ratio of each symbol to the number of symbols in the string, p=i/l, so . Where l is the length of the encoded string and i is the number of each symbol in it. To get the best estimate, we would have to consider all codings. By using the minimum of all 10 possible codings of length up to four DNA bases, we get a reasonable estimate that can deal exactly with not only runs of single bases up to runs of four repeated bases, but gives reasonable estimates with larger repeating sequences. DNA bases which are unknown (i.e. coded as N) are ignored. We use . Thus the sequence ACGTACGTACGTACGTACGT, which is highly compressible, has an entropy of −(5/5) log2(5/5)=0. Whereas a simple count of number of bases would show A C G and T each occur 5 times (are present in equal numbers) and so incorrectly would say the string has maximal entropy . More sophisticated calculations might consider longer potential coding sequences but then the coding tables would be much larger and eventually their information content could no longer be ignored.
Two base colorspace encoding
Some next generation DNA scanners use a technology which instead of reading DNA sequences one base at a time they use multiple fluorescent dyes to read adjacent (overlapping) pairs of bases. Reduced noise is claimed, since as the pairs overlap, each base is read twice. Data are presented as the initial base followed by transitions from one base type to the next in the sequence (hence needing 4 colours). A potential downside is if an error does occur, the rest of the sequence will be nonsense. Whereas in direct encoding only the erroneous base is effected. It is possible to convert between the two encodings. However because of the different noise characteristics it is usually recommended, as we did, to use tools like Bowtie which can deal with colorspace encoded data directly.
Selecting a high quality sample to confirm with NCBI BLAST
We used NCBI’s Blast [10] program to confirm our Bowtie results. (We used the default parameters provided by the EBI web interface except we request the first 1000 matches, rather than the first 50 matches). Using BLAST on each of the sequences in Table 3 shows each of the seven high quality DNA measurements (see page 39) do, as expected, match one or more species of Mycoplasma and none matches the reference human genome. In a few cases the second pair matches “Homo sapiens clones”, rather than the human reference sequence. Often these are draft sequences and only in one case (ERR013159.14600701) do both ends of DNA pair match the clone. The final column of Table 3 reports an example of one of the Mycoplasma genes which BLAST finds which match the DNA sequence. In the case of paired end DNA measurements, BLAST has been run separately on both end. The reported gene is matched by both ends. (In three cases an example gene has not been chosen because BLAST matches the whole of, a number of, Mycoplasma genomes). Noting the example gene’s similarity, it is tempting to ascribe some biological meaning to the gene, however BLAST effectively searches all the published DNA sequences and so the similarity may well simply reflect a bias in the published sequences. Ribosomal DNA is highly conserved and has been heavily studied as a tree of life phylogenetic marker of evolutionary inheritance, which makes it one of the more frequent genes in today’s DNA sequence databanks.
We take BLAST’s matches and the lack of BLAST matches against the official human reference genome as confirming our Bowtie results. That is, Table 3 suggests samples ERR009050, ERR002459, ERR013159 and ERR022473 appear to have been contaminated with Mycoplasma. However, of these four, only in one (ERR009050) are there more than a few score DNA measurements which Bowtie matches against Mycoplasma.