SICTIN: Rapid footprinting of massively parallel sequencing data
© Enroth et al; licensee BioMed Central Ltd. 2010
Received: 23 February 2010
Accepted: 13 August 2010
Published: 13 August 2010
Massively parallel sequencing allows for genome-wide hypothesis-free investigation of for instance transcription factor binding sites or histone modifications. Although nucleotide resolution detailed information can easily be generated, biological insight often requires a more general view of patterns (footprints) over distinct genomic features such as transcription start sites, exons or repetitive regions. The construction of these footprints is however a time consuming task.
The presented software generates a binary representation of the signals enabling fast and scalable lookup. This representation allows for footprint generation in mere minutes on a desktop computer. Several different input formats are accepted, e.g. the SAM format, bed-files and the UCSC wiggle track.
Hypothesis-free investigation of genome wide interactions allows for biological data mining at a scale never before seen. Until recently, the main focus of analysis of sequencing data has been targeted on signal patterns around transcriptional start sites which are in manageable numbers. Today, focus is shifting to a wider perspective and numerous genomic features are being studied. To this end, we provide a system allowing for fast querying in the order of hundreds of thousands of features.
Massively parallel sequencing is rapidly becoming the gold standard in hypothesis-free genome-wide studies of for instance transcription factor binding sites [1, 2] histone modifications [3–5] and nucleosome positioning [6, 7]. In such large-scale studies, the general pattern of these events are often sought/monitored around distinct genomic coordinates such as transcription start/end sites or other biological features such as transcription factor binding sites. The creation of gene-centred footprints can be a computationally intensive and time consuming process, e.g., the most recent human ensembl database  lists 23'438 genes corresponding to 140'426 annotated transcripts and 528'281 exons. Furthermore, as standard off-the-shelf desktop computers often come with hundreds of gigabyte of hard drive storage space is not an issue. With this in mind, we designed a simple program suite that stores the fragment count (overlaps) in a binary format suitable for fast access and recovery. To the best of our knowledge, no other program suites for sequence data are tailored against the task of producing footprints. There are several other applications (e.g. [9–11]) for visualizing complementary components such as individual fragments including alignment mismatches against reference sequences, or, that allows for interactive browsing through the data, which SICTIN does not. Since we only store the pileup information and the program suite is designed with access/lookup speed as the primary goal we chose a binary representation of every base pair along the genome. In addition, although this representation introduces overhead disc usage where there are no aligned fragments, this solution makes accessing and footprint calculation fast and easy.
The program suite consists of two major parts, i) the program that creates the binaries, build_binaries, where the fragment overlap counts (pileups) are stored and ii) programs that access the signals at given coordinates. The latter category has two sub-programs each designed to perform slightly different tasks, one which pulls regions specified by start and stop coordinates (access_signal) and one which computes footprints (make_footprint), i.e. average signals over given regions centred on specific locations with respect to transcriptional direction (strand).
Building the binaries
The build_binaries program currently accepts four input formats. Firstly, files in the SAM  format which is becoming the standard format for representing aligned fragments. Secondly, a text file (GFF) with columns specifying sequence, start, stop, orientation and a possible count/score of the read. The user can specify which data is given in each column and what character (or string) that separates the columns in the file. These input files need not be ordered in any way, although the building times are much shorter using sorted input since ordering reduces the time used to position the physical heads on the hard drive by the underlying file system. The program also accepts BED and WIG formatted tracks from the UCSC Genome Browser . In case of BED files, which are 0-based left-closed and right-open, the resulting binary files will be 1-based and closed.
The program creates separate binary files for each reference sequence listed in the input file and for fragments aligned to the sense (forward) and anti-sense (reverse) strand of the reference sequence. If a combined signal - where the fragments are prolonged to a biologically relevant length, e.g. size of sonicated fragments - is desired, an additional file containing the combined signal is created with a prolongation length defined by the user. The user can also choose to store only the start coordinates of the fragments. Finally, a text file containing some basic statistics (high/low coordinates, number of fragments) of the run is also produced. All user changeable parameters are described in Additional file 1, Table S1.
Accessing the binaries
In each created binary file we first store an offset indicating the first (lowest) genomic coordinate of the input fragments and then find a given coordinate by moving a file pointer in the binary file according to this offset. This gives almost constant access to any coordinate in a given sequence i.e. chromosome. We provide two main access modes; looking up specific regions with given start and stop coordinates or providing averaged signals over several coordinates with respect to strand information, i.e. footprints. In the footprint-case, the resulting signals are always given in the transcription-direction of the queried gene/location. The user can also choose to report only regions with counts above a certain threshold or moving all queries by a specified distance. The latter could be used as negative control e.g. to detected transcription factor binding sites.
Build and accessing times
To test the access times, we generated sets of randomly (uniform) distributed queries over the whole genome with a uniform distribution over sequences (chromosomes) in order to enforce many shifts in signal sources. In Figure 1B the average lookup times over 10 replicates using 100 to 250'000 queries are depicted. Even complex queries with 10 replicates - for instance expression classes - of thousands of features can be completed in 15 minutes.
Case study A
In a recent study  we examined the binding patterns for 38 previously published human data sets of nucleosome location and various histone modifications such as methylations and acetylations. Previously, these modifications had been mapped out on and around transcription start sites (TSS). We decided to take this analysis one step further and investigated all binding patterns on and around all exons thereby scaling the number of genomic features by a factor 10. We were also able to rapidly create footprints split both on the length of the exon and on the expression of the exon/gene. In total we produced on average 6 different footprints for each of the 38 factors investigated each based on tens of thousands of genomic locations. Once the binary representation of the signals was in place we could also easily explore other relevant biological questions. In the following two sections the nucleosome data of human resting T-cells from Schones et al. was used.
Micrococcal nuclease specificity
Case study B
The so called mappability, or uniqueness of the genomic sequence itself is an important factor in the alignment process for next generation sequence data. Less unique regions in a genomic sense are likely to receive lower counts since reads that align to too many genomic locations are excluded. We decided to compare the so called mappability to a RNA-seq data set. The mappability used here was constructed from a wig-file representation of the Broad alignability track exported from the UCSC genome browser . This data was generated as part of the ENCODE project . The RNA-seq data was downloaded from the example data collection available from the SOLiD Software Development Community website . The RNA-seq data consisted of around 175 M aligned fragments.
The advent of massively parallel sequencing technologies has opened a field for hypothesis-free investigation of e.g. protein-DNA interaction. In order to facilitate truly exploratory biological data mining we have designed and implemented a system where footprints over genomic features ranging in the order of hundreds of thousands can be rapidly constructed once the binary representation of the signals has been built. Such investigations could include signal over microsatellite repeats, ultra conserved regions or exons. Since the program suite also parses for instance the WIG-format, the analysis can easily be extended to footprinting GC-content or analysing any annotation track from the UCSC repositories .
Availability and requirements
Project name: SICTIN, source code available in 'Additional file 2'
Operating system(s): Platform independent, tested on Windows XP/Vista, Mac OS X Leopard, Linux (CentOS 5)
Programming language: C/C++
Licence: Lesser GNU General Public License (LGPL)
SE, RA and JK were supported by The Knut and Alice Wallenberg Foundation and by The Swedish Foundation for Strategic Research. JK was supported by the Polish Ministry of Science and Higher Education, grant number N301 239536. CW was supported by the Swedish Research Council and the Swedish Cancer Foundation.
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.View ArticlePubMedGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature methods. 2007, 4: 651-657. 10.1038/nmeth1068.View ArticlePubMedGoogle Scholar
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129: 823-837. 10.1016/j.cell.2007.05.009.View ArticlePubMedGoogle Scholar
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007, 448: 553-560. 10.1038/nature06008.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone acetylations and methylations in the human genome. Nature genetics. 2008, 40: 897-903. 10.1038/ng.154.View ArticlePubMedPubMed CentralGoogle Scholar
- Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome. Cell. 2008, 132: 887-898. 10.1016/j.cell.2008.02.022.View ArticlePubMedGoogle Scholar
- Valouev A, Ichikawa J, Tonthat T, Stuart J, Ranade S, Peckham H, Zeng K, Malek JA, Costa G, McKernan K: A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. Genome research. 2008, 18: 1051-1063. 10.1101/gr.076463.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Hubbard TJP, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L: Ensembl 2009. Nucl Acids Res. 2009, 37: D690-697. 10.1093/nar/gkn828.View ArticlePubMedGoogle Scholar
- Hou H, Zhao F, Zhou L, Zhu E, Teng H, Li X, Bao Q, Wu J, Sun Z: MagicViewer: integrated solution for next-generation sequencing data visualization and genetic variation detection and annotation. Nucleic acids research. 2010, 38: W732-W736. 10.1093/nar/gkq302.View ArticlePubMedPubMed CentralGoogle Scholar
- Bao H, Guo H, Wang J, Zhou R, Lu X, Shi S: MapView: visualization of short reads alignment on a desktop computer. Bioinformatics. 2009, 25: 1554-1555. 10.1093/bioinformatics/btp255.View ArticlePubMedGoogle Scholar
- Huang W, Marth G: EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome research. 2008, 18: 1538-1543. 10.1101/gr.076067.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing Subgroup: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.View ArticlePubMedPubMed CentralGoogle Scholar
- Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M: The UCSC Genome Browser Database: update 2009. Nucl Acids Res. 2009, 37: D755-761. 10.1093/nar/gkn875.View ArticlePubMedGoogle Scholar
- Andersson R, Enroth S, Rada-Iglesias A, Wadelius C, Komorowski J: Nucleosomes are well positioned in exons and carry characteristic histone modifications. Genome research. 2009, 19: 1732-1741. 10.1101/gr.092353.109.View ArticlePubMedPubMed CentralGoogle Scholar
- Dingwall C, Lomonossoff GP, Laskey RA: High sequence specificity of micrococcal nuclease. Nucl Acids Res. 1981, 9: 2659-2674. 10.1093/nar/9.12.2659.View ArticlePubMedPubMed CentralGoogle Scholar
- Tolstorukov MY, Kharchenko PV, Goldman JA, Kingston RE, Park PJ: Comparative analysis of H2A.Z nucleosome organization in the human and yeast genomes. Genome research. 2009, 19: 967-977. 10.1101/gr.084830.108.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang C, Pugh BF: A compiled and systematic reference map of nucleosome positions across the Saccharomyces cerevisiae genome. Genome Biol. 2009, 10: R109-10.1186/gb-2009-10-10-r109.View ArticlePubMedPubMed CentralGoogle Scholar
- Oberdoerffer S, Moita LF, Neems D, Freitas RP, Hacohen N, Rao A: Regulation of CD45 alternative splicing by heterogeneous ribonucleoprotein, hnRNPLL. Science. 2008, 321: 686-691. 10.1126/science.1157610.View ArticlePubMedPubMed CentralGoogle Scholar
- Rada-Iglesias A, Ameur A, Kapranov P, Enroth S, Komorowski J, Gingeras TR, Wadelius C: Whole-genome maps of USF1 and USF2 binding and histone H3 acetylation reveal new aspects of promoter structure and candidate genes for common human disorders. Genome research. 2008, 18: 380-392. 10.1101/gr.6880908.View ArticlePubMedPubMed CentralGoogle Scholar
- Lin JM, Collins PJ, Trinklein ND, Fu Y, Xi H, Myers RM, Weng Z: Transcription factor binding and modified histones in human bidirectional promoters. Genome research. 2007, 17: 818-827. 10.1101/gr.5623407.View ArticlePubMedPubMed CentralGoogle Scholar
- Birney E, Stamatoyannopoulos J, Dutta A, Guigo R, Gingeras T, Margulies E, Weng Z, Snyder M, Dermitzakis E, Thurman R: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 799-816. 10.1038/nature05874.Google Scholar
- SOLiD™ System Human Small RNA Data Set. [http://solidsoftwaretools.com/gf/project/srna/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.