Mycoplasma contamination in the 1000 Genomes Project
© Langdon; licensee BioMed Central Ltd. 2014
Received: 23 May 2013
Accepted: 19 February 2014
Published: 29 April 2014
Skip to main content
© Langdon; licensee BioMed Central Ltd. 2014
Received: 23 May 2013
Accepted: 19 February 2014
Published: 29 April 2014
In silco Biology is increasingly important and is often based on public data. While the problem of contamination is well recognised in microbiology labs the corresponding problem of database corruption has received less attention.
Mapping 50 billion next generation DNA sequences from The Thousand Genome Project against published genomes reveals many that match one or more Mycoplasma but are not included in the reference human genome GRCh37.p5. Many of these are of low quality but NCBI BLAST searches confirm some high quality, high entropy sequences match Mycoplasma but no human sequences.
It appears at least 7% of 1000G samples are contaminated.
Mycoplasma are tiny bacteria which readily grow in cell culture media. They have small genomes. Contamination of molecular biology laboratories by them is widespread . Their small size makes them hard to detect. Depending upon medium, Mycoplasma contamination rates of 1% to 15–35% (or even higher) have been reported . Mycoplasma contamination can render cell line gene expression measurements unreliable . Many labs routinely sterilised their equipment to counter it. About 1% of published NCBI’s Gene Expression Omnibus (GEO)  GeneChip data appear to be contaminated [4, 5]. Indeed wet lab contamination is so wide spread that Mycoplasma genes have managed to jump the silicon barrier and get themselves incorporated into international data banks as Human genes .
GEO contains gene expression data, here we start to look for similar contamination in genome studies. The 1000 Genomes Project  is an international collaboration which has mapped in whole or in part the genomes or more then 2500 individuals and published studies of SNPs and other human genetic variations. We selected The 1000 Genomes Project, since it investigates human genetic material, is widely respected, it covers many sites with diverse data sources and has made available vast quantities of its raw data.
NextGen scanners are noisy. So, on the assumption that errors are independent, typically multiple (e.g. 3) scans are run. However non-uniform clusters of errors indicate that they are not independent and therefore redundant scans may not resolve the problem. Noise may be part of the reason why Bowtie reports about 30% fail to align to the human genome. However some of these unmatched DNA measurements may not be simply due to noise. These are the ones we investigate to see if they could be due to Mycoplasma contamination.
Approximately 8% of The 1000 Genomes Project was selected at random and downloaded
Solexa data, like that from other nextGen scanners, are inherently noisy. Solexa provides an estimate of the signal to noise ratio (expressed as log10) per base position in each DNA sequence. (For example, a quality of 0.5 (S/N=3.16) means the returned base is more likely than the other 3 combined)b. This can easily mount up to several hundred quality values. To stably condense these into a manageable statistic, we ignore the worse and second to worst base in each DNA sequence and use the third worst. For paired end data, we use worst of the two ends.
In our random download of The 1000 Genomes Project, 1944 high quality DNA measurements (i.e. no more than three bases with quality worse than 0.5) match Mycoplasma (with on average three or fewer mismatches) but do not match at all the reference human genome (bottom column 3)
High quality, non-repetitive DNA measurements from The 1000 Genomes Project which match one or more published Mycoplasma genomes but which do not match the reference human genome
GCCGTAACTATAACGGTCCTAAGGTAGCGAAATTCCTTGTC E=7 10−13
S16 23S ribosomal RNA
ACGGTTTTCAAGACCGTTCCCTTCAGCCAGACTTGG E=5 10−10
CCTGACGGTTTTCAAGACCGTTCCCTTCAGCCAGAC E=5 10−10
GGATTTCTAGAGTTGATTTACCATATTCTAA E=3 10−52
AACGTCTCCTGGTAATTTTTTAGGTTTTCT E=3 10−52
TTTGGTTAGTTTAAAATAACCATCAAAAG E=2 10−10
AATTCATAACGTGTAATTTGTCTTTCAGGAAC E=3 10−52
CTTTAATCAAAGAAGTAGTGAACCAAGAAGATATT E=3 10−6
TATTGCAAATATTGTTTCTAAATGAACAAAAATTCC E=3 10−9
GCAACACACGTGCTACAATGGTCGGTACAAAGAGA E=3 10−52
16S ribosomal RNA
AAACCGATCTCAGTTCGGATTGAAGTCTGCAACTCG E=3 10−52
TCATTTGGGTTCATTACAGAACCTCTAACTGT E=3 10−52
Ribosomal protein cluster
AATGTTAACTAGGTTATGTTCTTCATTTCCTA E=3 10−52
Here we have analysed DNA sequences directly, rather than gene expression. While the techniques are totally different, there is still considerable scope for sample contamination and sequence comparison, Table 2, suggests at least 7% of public data provided by The 1000 Genomes Project may have some Mycoplasma contamination. However the fraction may be higher due to: overlap in DNA sequence space between Human and Mycoplasma genomes and due to excluding low quality data.
Whilst the problem of contamination of nextGen sequences has been considered before, previous studies, e.g. Jun et al.  and Cibulskis et al. , have looked at contamination by other members of the same species. Indeed there have been several reports of unexpected personal, i.e. human, DNA in The 1000 Genomes Project public data but no reports of non-human contamination. However we downloaded and scanned a random sample of more than 50 billion DNA measurements from their FTP site and found tens of thousands which may have come from Mycoplasma contamination. Since some DNA sequences have been conserved by evolution, it is possible the contamination is from similar species.
Once Mycoplasma is suspected, it may be that individual scans can be clean up relatively easily as cross-species contamination is said to be easily detected (, page 2601). Indeed a number of commercial Mycoplasma detection tools are based on looking for Mycoplasma genes . However both current microbiology laboratory  and Bioinformatics  typically take the robust approach of removing (deleting) all potentially infected materials. Indeed when The 1000 Genomes Project withdraws nextGen data, it withdraws complete scans. That is, it simply discards information on about a billion DNA bases each time a scan is withdrawn.
Raw data from The 1000 Genomes Project are publicly available and are being increasingly widely and diversely used. Whilst noisy data may be acceptable for use by their original owners, who are aware of their limitations, there is an increasing risk of contaminated data being (ab)-used outside the laboratories which initial created them. Indeed with staff-turnover there may be risks associated with using what becomes historical data where their provenance becomes more cloudy. Independent numerical studies could be done. The size of our sample suggests (at least for historical data drawn from the same period) they should yield the same results. However, whilst we have established a lower bound for contamination, future studies should be able to calculate it more precisely. For example, by considering redundant scans and clusters it should be possible to isolate the source and perhaps also provide numerical techniques to mitigate the data . Other studies might also look for other effects and thus extract more scientific knowledge from this valuable resource.
Since Mycoplasma are rampant in modern microbiology laboratories  it is no surprise to find some in parts of data from The 1000 Genomes Project. We have identified some samples which have a higher than average chance of being contaminated by Mycoplasma. In silico studies should be reinforced by checking the source of the data. We urge each member of The 1000 Genomes Project Consortium (as some are apparently doing ), particularly those using single ended colorspace scanners (cf. Table 1) to re-check their procedures. Drexler and Uphoff  suggest using at least two detection techniques when checking samples for Mycoplasma.
The master index file, sequence.index, which describes all the current 1000 Genomes Project data was down loaded . As of 8 February 2013 there were 47,315 scans available (a further 208 had been withdrawn). They comprised: 39 736 paired-end and 4822 single ended DNA sequence scans plus a further 1611 (paired end) and 938 (single ended) scans which used ABI_SOLID colorspace encoding. 4058 were randomly chosen and down loaded. All the DNA measurements are in fastq format, so they include a quality score per DNA base pair. Each scan contains DNA sequences of the same length. Figure 4 shows the distribution of DNA sequence lengths. Almost all colorspace sequences contain 25, 35 or 50 base pairs, whereas lengths 68, 76, 100 and 101 dominate non-colorspace sequences.
On average: each scan contained 13 million DNA sequences (or pairs of sequences). Even compressed, each file is approximately a gigabyte. (Compression reduces download size by a factor of about 3.1) Paired end scans need two such files. The down load speed was variable, typically between 2.5 106 and 36 106 bytes/second, with a mean of 11 million bytes per second. In total 7547 files were down loaded (6.0 terabytes) containing 51 494 393 834 DNA measurements totalling about 7.5 1012 base pairs.
Notice (Figure 10) Bowtie is usually faster on single ended rather than paired double ended DNA sequences (mean 28 v. 18 million sequences per hour per CPU). Although downloading and decompressing the files took 37% of the elapsed time, despite using all 8 CPU cores, almost all the remaining 63% of time was used by Bowtie.
In statistical mechanics, entropy is the degree of disorder in a system . In information theory this translates to the degree or randomness or incompressibility of data, particularly in transmission of messages . , where p is the probability of a sequence of symbols and we sum over all possible symbols. For replicability, the remainder of this section details how we approximate entropy using actual DNA base counts in finite sequences.
In order to have entropy expressed in bits we use log2.
A reasonable estimate of the compressibility of variable length DNA sequences can be made by considering all loss-less coding schemes of up to four bases. The most efficient coding scheme gives the most compressible output. For example, a long sequence of adenine (AAAAAAAAAAAAAAAAAAAA...) can be recoded as a shorter sequence of 00000..., where 0 is one of the new 256 codes needed to represent AAAA–TTTT. Since the coding is loss-less, the encoded sequence contains the same information and so it has the same entropy.
We approximate probability p by the actual ratio of each symbol to the number of symbols in the string, p=i/l, so . Where l is the length of the encoded string and i is the number of each symbol in it. To get the best estimate, we would have to consider all codings. By using the minimum of all 10 possible codings of length up to four DNA bases, we get a reasonable estimate that can deal exactly with not only runs of single bases up to runs of four repeated bases, but gives reasonable estimates with larger repeating sequences. DNA bases which are unknown (i.e. coded as N) are ignored. We use . Thus the sequence ACGTACGTACGTACGTACGT, which is highly compressible, has an entropy of −(5/5) log2(5/5)=0. Whereas a simple count of number of bases would show A C G and T each occur 5 times (are present in equal numbers) and so incorrectly would say the string has maximal entropy . More sophisticated calculations might consider longer potential coding sequences but then the coding tables would be much larger and eventually their information content could no longer be ignored.
Some next generation DNA scanners use a technology which instead of reading DNA sequences one base at a time they use multiple fluorescent dyes to read adjacent (overlapping) pairs of bases. Reduced noise is claimed, since as the pairs overlap, each base is read twice. Data are presented as the initial base followed by transitions from one base type to the next in the sequence (hence needing 4 colours). A potential downside is if an error does occur, the rest of the sequence will be nonsense. Whereas in direct encoding only the erroneous base is effected. It is possible to convert between the two encodings. However because of the different noise characteristics it is usually recommended, as we did, to use tools like Bowtie which can deal with colorspace encoded data directly.
We used NCBI’s Blast  program to confirm our Bowtie results. (We used the default parameters provided by the EBI web interface except we request the first 1000 matches, rather than the first 50 matches). Using BLAST on each of the sequences in Table 3 shows each of the seven high quality DNA measurements (see page 39) do, as expected, match one or more species of Mycoplasma and none matches the reference human genome. In a few cases the second pair matches “Homo sapiens clones”, rather than the human reference sequence. Often these are draft sequences and only in one case (ERR013159.14600701) do both ends of DNA pair match the clone. The final column of Table 3 reports an example of one of the Mycoplasma genes which BLAST finds which match the DNA sequence. In the case of paired end DNA measurements, BLAST has been run separately on both end. The reported gene is matched by both ends. (In three cases an example gene has not been chosen because BLAST matches the whole of, a number of, Mycoplasma genomes). Noting the example gene’s similarity, it is tempting to ascribe some biological meaning to the gene, however BLAST effectively searches all the published DNA sequences and so the similarity may well simply reflect a bias in the published sequences. Ribosomal DNA is highly conserved and has been heavily studied as a tree of life phylogenetic marker of evolutionary inheritance, which makes it one of the more frequent genes in today’s DNA sequence databanks.
We take BLAST’s matches and the lack of BLAST matches against the official human reference genome as confirming our Bowtie results. That is, Table 3 suggests samples ERR009050, ERR002459, ERR013159 and ERR022473 appear to have been contaminated with Mycoplasma. However, of these four, only in one (ERR009050) are there more than a few score DNA measurements which Bowtie matches against Mycoplasma.
a Some scanners report DNA sequences for both ends of a fragment of DNA. Nonetheless the pair of sequences is considered one “DNA measurement”. See also Figure 2.
b Whilst details depend on the individual manufacturer, essentially each base is allocated a different colour. The brightest colour indicates the base and the quality is estimated from how strong it is compared to the other three colours.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.