This article has Open Peer Review reports available.
Blurring contact maps of thousands of proteins: what we can learn by reconstructing 3D structure
 Marco Vassura^{1}Email author,
 Pietro Di Lena^{1},
 Luciano Margara^{1},
 Maria Mirto^{2},
 Giovanni Aloisio^{2},
 Piero Fariselli^{3} and
 Rita Casadio^{3}
https://doi.org/10.1186/1756038141
© Vassura et al; licensee BioMed Central Ltd. 2011
Received: 26 August 2010
Accepted: 13 January 2011
Published: 13 January 2011
Abstract
Background
The present knowledge of protein structures at atomic level derives from some 60,000 molecules. Yet the exponential ever growing set of hypothetical protein sequences comprises some 10 million chains and this makes the problem of protein structure prediction one of the challenging goals of bioinformatics. In this context, the protein representation with contact maps is an intermediate step of fold recognition and constitutes the input of contact map predictors. However contact map representations require fast and reliable methods to reconstruct the specific folding of the protein backbone.
Methods
In this paper, by adopting a GRID technology, our algorithm for 3D reconstruction FTCOMAR is benchmarked on a huge set of non redundant proteins (1716) taking random noise into consideration and this makes our computation the largest ever performed for the task at hand.
Results
We can observe the effects of introducing random noise on 3D reconstruction and derive some considerations useful for future implementations. The dimension of the protein set allows also statistical considerations after grouping per SCOP structural classes.
Conclusions
All together our data indicate that the quality of 3D reconstruction is unaffected by deleting up to an average 75% of the real contacts while only few percentage of randomly generated contacts in place of noncontacts are sufficient to hamper 3D reconstruction.
Background
A major problem of the genomic era is how to link the protein sequence to the protein structural and functional space. When no template with high sequence homology to the target is found in the Protein Data Base (PDB), then building by homology cannot be safely applied. In these cases the protein structure can be predicted with ab initio methods whose scoring capability is poor when no conserved structural domain is recognized in the target. Structural features, including structural conserved domains, disulfide bonds, protein secondary structure, residue solvent accessibility and how residues contribute to local stability (contact residues), can to some extent help in constraining the protein 3D structure. Residues are defined to be in contact in the protein structure when they interact within a fixed distance (threshold) that is routinely set at a value ≥ 7 Å. Residue contact prediction was exploited with different approaches, including statistical and probabilistic methods [1]. In a contact map representation of the protein 3D structure, all the short and long range interactions promoting protein stability emerge to different extent depending on the threshold value adopted to compute the 2D projection. However, this representation poses first of all the problem of structure reconstruction. Recently it has been shown that the problem of computing a set of 3D coordinates consistent with some given contact map is equivalent to the unitdiskgraph realization, which is NPhard [2]. Other well studied similar problems are structure determination from NMR data [3, 4] and protein conformational freedom [5]. However the different solutions described are not suited to protein 3D reconstruction given the different nature of distance constraints induced by the protein contact map. Several heuristic algorithms have been developed to address specifically the problem [[6–10], and [11]]. Routinely, most of the methods were also tested on randomly blurred contact maps derived from small sets of proteins (in the range of 2030 chains) and no general conclusion was derived.
In order to address the problem of structure reconstruction we developed COMAR [12], and FTCOMAR 1.0 [13], both performing quite efficiently. With FTCOMAR we could analyze the reconstruction performance on a set of 100 protein contact maps containing random errors [13]. Recently a method focused on the search of the essential contacts in contact maps for protein 3D reconstruction. The method is however tested only on 12 proteins and this hampers again large scale statistical considerations [14].
In this paper we analyze the performances of FTCOMAR 2.0, a modified version of FTCOMAR 1.0 where reconstructed structures satisfy known protein constraints (available on the web [15]). Our tests are performed with a GRID technology on a much larger data set (1716 proteins) than in previous similar analysis from this group (100 proteins, [13]), and after introducing random blurring of the computed maps. By this, we derive some conclusions that may help future implementations of methods for 3D reconstruction. We investigate the reconstruction quality as dependent on the protein length, and on the four major SCOP classes. We also investigate the effect of three types of random errors, general and/or restricted to contacts and noncontacts. We find that the reconstruction quality decreases at increasing protein length and this is rather independent of the protein structural class. Furthermore we find that randomizing errors on the map is conducive to the same reconstruction performance that is obtained when errors are randomly restricted to noncontacts. On the contrary random errors on contacts are highly tolerated and up to 50% of contacts may be wrong without a great loss of 3D reconstruction quality (RMSD≤5 Å). We then address the question of how many correct contacts we need in order to reconstruct the protein and we find that only 25% of correct entries are sufficient to obtain a 3D structure with RMSD≤5 Å from the native one. This effect is independent of the protein length and indicates that FTCOMAR can correctly reconstruct the 3D structure even from a small fraction of correct contacts. Prompted by this finding we develop a filter procedure that when applied makes the protein reconstruction independent of the protein length as long as 10% of random errors is included in the map.
Methods
Data set
The protein dataset was selected from SCOP [16], release 1.67. We removed sequence redundancy by using BLAST [17] and retrieved from PDB only those complete structures whose resolution is <2.5 Å. Our final dataset consists of 1716 protein chains with sequence similarity <25%.

1362 mono domain proteins: 251 all Alpha, 286 all Beta, 376 Alpha/Beta, 332 Alpha+Beta; and 117 in other classes;

354 multidomain proteins: 17 all Alpha, 42 all Beta, 46 Alpha/Beta, 39 Alpha+Beta; and 210 in other classes.
Protein representation and contact maps
where C _{ k }∈ R ^{ 3 } ^{×} ^{ n }is obtained by rotating, translating, or mirroring the coordinates set C. Mirroring is needed since the native structure and its topological mirror share the same distance map and thus the same contact map. In this work we consider structures to be similar only when their RMSD value is ≤5 Å.
Description of FT_COMAR 2.0
In this section we describe FTCOMAR 2.0, a fault tolerant version of COMAR generating 3D structures satisfying the backbone constrains.
FTCOMAR(CM ∈ {1,0,1}^{ n } ^{×} ^{ n }, t ∈ N)
//Preprocessing phase: error filtering
1: CM' ← FILTER(CM)
//First phase: initial solution generation
2: C ← FTRANDOMPREDICT(CM', t)
//Second phase: refinement
3: C ← FTCORRECT(CM', C, t)
4: set ε to a strictly positive value
5: while C is not a Cα trace consistent with CM' and ε > 0 do
6: C ← FTPERTURBATE(CM', C, t, ε)
7: C ← FTCORRECT(CM', C, t)
8: decrement slightly ε
9: if C is not a Cα trace then C ← CαTRACE(CM', C, t)
10: return C
FTCOMAR consists of three phases. In the preprocessing phase, the input contact map is scanned with a filtering procedure (FILTER) in order to mark the unsafe entries. The marked entries will then be ignored in the next computations (FTRANDOMPREDICT, FTPERTURBATE and FTCORRECT). In the first phase (Phase 1), the algorithm generates a random initial set of 3D coordinates C ∈ R ^{3 ×} ^{ n }(RANDOMPREDICT) that is the starting point for the refinement procedure. In Phase 2 the algorithm iteratively applies two local correction/perturbation techniques to the current set of coordinates, FTCORRECT and FTPERTURBATE. This procedure refines the initial set of coordinates and eventually leads to a new set of coordinates that are completely or almost completely consistent with the given contact map. The refinement continues until the set of coordinates satisfies the protein constraints provided by the input contact map or until a control parameter ε becomes 0. The control parameter ε has an initial positive value and it is iteratively decremented after some refinement steps.
As a final check, the CαTRACE function ensures that the reconstructed structure satisfies the backbone constrains, namely the distance between consecutive coordinates, set between 3.5 and 4 Å, and the minimum distance between any pair of coordinates, set to 3.5 Å. The FILTER function identifies unsafe areas of the contact map. The functions FTRANDOMPREDICT, FTCORRECT and FTPERTURBATE are similar to the non fault tolerant version, with the only difference of neglecting entries of the contact map labelled as unsafe. FTRANDOMPREDICT computes the initial solution. When fragments of the protein demonstrate a high degree of independence with respect to mutual interactions, FTRANDOMPREDICT splits the initial contact map into submatrices,. Then a set of coordinates is separately generated for each sub matrix with an embedding algorithm [3]. The sets of coordinates are then merged to give the initial solution. FTCORRECT moves residues in the reconstructed 3D structure in order to decrease the difference between entries of the computed and input contact maps while preserving identical values. Concomitantly with FTCORRECT, FTPERTURBATE perturbs the residue position for optimising the overlap of contact maps. Details on these functions can be found in [12].
In the following we describe FILTER and CαTRACE as a new development. FILTER searches input contact maps for 'unsafe' areas, namely false entries due to noise. This is implemented by assuming that two residues i,j are in contact if and only if they share a high number of neighbors, i.e. there is a high number of residues which are in contact with both i and j. In our dataset, at the selected contact threshold (12 Å, section 2.2), only 6% of residues which are in contact share less than 10 neighbors and just the 0.7% of residues which are not in contact share >18 neighbors. Thus our filtering procedure marks contact C [i, j] as unsafe (setting C [i, j] to 1) if:

C [i, j] = 1 (i and j are in contact) and i, j share <10 neighbours, i.e. residue i is in contact with <10 residues which are in contact also with residue j;

C [i, j] = 0 (i and j are not in contact) and i, j share >18 neighbours, i.e. residue i is in contact with >18 residues which are in contacts also with residue j.
FILTER output is the contact map with unsafe areas set to 1. These entries are then neglected by FTCOMAR.
The CαTRACE function changes a given set of coordinates to satisfy the following constraints as derived from the Cα protein representation:

the distance between consecutive coordinates i,i+1 is between 3.5 and 4 Å;

the distance between any pair of coordinates i,j is ≥3.5 Å.
The coordinate refinement is obtained with a correction/perturbation cycle [similarly to the refinement phase of FTCOMAR (section 2.3)].
CαTRACE ( CM ∈ {1,0,1}^{ n } ^{×} ^{ n }, C ∈ R ^{ 3 } ^{×} ^{ n } , t ∈ N)
1: set ε to a strictly positive value
2: while C is not a Cα trace consistent with CM and ε > 0 do
3: C ← FTPERTURBATETRACE(CM, C, t, ε)
4: C ← FTCORRECTTRACE(CM, C, t)
5: decrement slightly ε
6: if C is not a Cα trace then CαTRACEFIX(C, t)
7: return C
Here FTPERTURBATETRACE and FTCORRECTTRACE are similar to FTCORRECT and FTPERTURBATE with the only addition of the CαTRACE constraints. FTCORRECTTRACE moves residues and FTPERTURBATETRACE refines their mobility. When after refinement (lines 15 of CαTRACE) the set of coordinates C is not a Cα trace, the function CαTRACEFIX imposes the CαTRACE constraints neglecting the original contact map. This is obtained by running CαTRACE with an "unsafe" contact map (all entries set to 1).
Introducing random errors in real contact maps
To evaluate fault tolerance of FTCOMAR to white noise (i.e. random errors) we introduce three types of random errors:

Err. A random error is generated by flipping a random entry of the native contact map (Figure 2b). To introduce x% errors we generate x errors for each 100 couples of residues and the total number of errors is:$\frac{x}{100}\frac{n(n1)}{2}$(2)

Err1 (errors on contacts). The entry of the contact map is flipped only if it is a contact (Figure 2c). Here x% errors indicate that the total number of errors is:$\left(\frac{x}{100}.\#contacts\right)$(3)

Err0 (errors on noncontacts). Errors are generated as before by changing entries in the contact map only for non contacts (Figure 2d). Here x% errors indicate that the total number of errors is:$\frac{x}{100}\left(\frac{n(n1)}{2}\#contacts\right)$(4)
where n is the protein length.
We generate 10 (distinct) perturbed maps by introducing x% random errors on the native map and run our algorithm, partially randomized, 10 times on each map. By this in order to test the reconstruction tolerance in presence of x% random errors for every native contact map, we generate 10 perturbed contact maps and compute 10 reconstructions for each map, for a total of 100 runs.
Computational environment
Testing FTCOMAR is computationally expensive since it requires several applications that must be run to introduce errors in contact maps, compute the reconstruction and evaluate the performances. Each execution is repeated 100 times, as described in section 2.3, for a total of 12,154,234 jobs. This is a typical example of parameter sweep application (PSA), i.e. it consists of many looselycoupled tasks that can be executed in parallel [19, 20]. The single execution runs in a time ranging from micro seconds to several minutes depending on the protein length and on the percentage of errors introduced. Here the whole experiment was run by using the LIBI Grid PSE [21]. The average number of jobs running concurrently over the EGEE and SPACI Grid infrastructures was about 120 with a total of 4,500 different worker nodes. By this the execution time was greatly reduced from 34.16 years on a typical pc to about three months.
Results and Discussion
Protein structure reconstruction from contact maps with white noise
Protein structure reconstruction from contact maps as a function of the white noise type
Reconstruction of contact maps as function of its partial deletion
Reconstruction of contact maps as function of prefiltered white noise
Computing time
Conclusions
Reconstruction of contact maps is a necessary step of 3D protein reconstruction. The step is particularly relevant when contact maps are predicted. Presently the prediction quality of contact maps is still too low to allow protein reconstruction and this has been discussed elsewhere [15]. In this work we focus on the effect of white noise on contact map reconstruction with the specific aim of setting some constraints for future developments. For this reason we undertook a large scale analysis of the effect of random noise on the reconstruction of contact map with our FTCOMAR. Reconstruction quality decreases at increasing protein length and it is rather independent of the protein structural class, with the exclusion of allalpha proteins that on average are the most difficult to reconstruct. This can be reconciled with the suggestion that in contact maps long range contacts play a critical role in 3D reconstruction [1, 18] and that all alpha proteins are endowed with less long range contacts than the other SCOP classes.
The large scale analysis that allows a more accurate statistics than before indicates also that 25% of the randomly selected entries of the native contact map is enough to correctly reconstruct the protein structure. Considering that introducing random errors quickly degrades the quality of reconstruction and that this is not due to random flipping of contacts into noncontacts we conclude that the correctness of contacts in the map is more important than their relative abundance. Therefore our largescale effort validates the concept that wrong contacts make the reconstruction more problematic than missed contacts. Essential contacts for protein reconstruction were described before [14]. Also in our hand and for FT_COMAR, few key contacts are more conducive to the real/closetothereal protein structure than many noisy contacts. Prompted by this, we developed a simple filtering procedure. Its application that labels "unsafe" certain blurred areas of the map, greatly improves the quality of reconstructed structures even for long protein chains. All together these findings are landmarks to be considered in developing future 3D reconstruction tools and also predictors of contact maps.
Declarations
Acknowledgements
We thank MIUR for the following grants: PNR 20012003 (FIRB art.8) and PNR 2003 projects (FIRB art.8) on Bioinformatics for Genomics and Proteomics, both delivered to RC. All the authors thank the LIBILaboratorio Internazionale di BioInformatica.
Authors’ Affiliations
References
 Izarzugaza JM, Graña P, Tress ML, Valencia A, Clarke ND: Assessment of intramolecular contact predictions for CASP7. Proteins. 2007, 69 (Suppl 8): 1528. 10.1002/prot.21637.View ArticlePubMedGoogle Scholar
 Breu H, Kirkpatrick DG: Unit disk graph recognition is NPhard. Computational Geometry. 1998, 9: 324. 10.1016/S09257721(97)00014X.View ArticleGoogle Scholar
 Havel TF: Distance Geometry: Theory, Algorithms, and Chemical Applications. Encyclopedia of Computational Chemistry. 1998, John Wiley & Sons, LtdGoogle Scholar
 Moré J, Wu Z: Distance geometry optimization for protein structures. Journal on Global Optimization. 1999, 15: 219234.View ArticleGoogle Scholar
 De Groot BL, van Aalten DMF, Scheek RM, Amadei A, Vriend G, Berendsen HJC: Prediction of protein conformational freedom from distance constraints. Proteins. 1997, 29: 240251. 10.1002/(SICI)10970134(199710)29:2<240::AIDPROT11>3.0.CO;2O.View ArticlePubMedGoogle Scholar
 Bohr J, Bohr H, Brunak S, Cotterill RMJ, Fredholm H, Lautrup B, Petersen SB: Protein structures from distance inequalities. J Mol Biol. 1993, 231: 861869. 10.1006/jmbi.1993.1332.View ArticlePubMedGoogle Scholar
 Galaktionov SG, Marshall GR: Properties of intraglobular contacts in proteins: an approach to prediction of tertiary structure. System Sciences, 1994. Vol.V:, Proceedings of the TwentySeventh Hawaii International Conference on Biotechnology Computing. 1994, 5: 326335.Google Scholar
 Pollastri G, Vullo A, Fiasconi P, Baldi P: Modular DAGRNN Architectures for Assembling Coarse Protein Structures. J Comp Biol. 2006, 13 (3): 631650. 10.1089/cmb.2006.13.631.View ArticleGoogle Scholar
 Vendruscolo M, Kussell E, Domany E: Recovery of protein structure from contact maps. Folding and Design. 1997, 2 (5): 295306. 10.1016/S13590278(97)000412. September 1997View ArticlePubMedGoogle Scholar
 Vendruscolo M, Domany E: Protein folding using contact maps. Vitam Horm. 2000, 58: 171212. full_text.View ArticlePubMedGoogle Scholar
 Chen Y, Ding F, Dokholyan NV: Fidelity of the Protein Structure Reconstruction from InterResidue Proximity Constraints. J Phys Chem B. 2007, 111 (25): 74327438. 10.1021/jp068963t.View ArticlePubMedGoogle Scholar
 Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R: Reconstruction of 3D Structures From Protein Contact Maps. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2008, 5: 310.1109/TCBB.2008.27. JulySeptember 2008View ArticleGoogle Scholar
 Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R: Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps. Seventh International Workshop on Algorithms in Bioinformatics (WABI 2007), Pennsylvania 2007. Springer Verlag Lecture Notes in Bioinformatics. 2007, 4645: 2537.Google Scholar
 Sathyapriya R, Duarte JM, Stehr H, Filippis I, Lappe M: Defining an Essence of Structure Determining Residue Contacts in Proteins. PLoS Comput Biol. 2009, 5 (12): e100058410.1371/journal.pcbi.1000584.View ArticlePubMedPubMed CentralGoogle Scholar
 Vassura M, Margara L, Di Lena P, Medri F, Fariselli P, Casadio R: FTCOMAR: fault tolerant threedimensional structure reconstruction from protein contact maps. Bioinformatics. 2008Google Scholar
 Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C, Murzin AJ: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004, D2269. 10.1093/nar/gkh039. 32 DatabaseGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389402. 10.1093/nar/25.17.3389.View ArticlePubMedPubMed CentralGoogle Scholar
 Bartoli L, Capriotti E, Fariselli P, Martelli PL, Casadio R: The pros and cons of predicting protein contact maps. Protein Structure Prediction. Edited by: Zaki MJ, Bystroff C. 2008, Humana Press: New York, NY, USA, 199217. full_text.View ArticleGoogle Scholar
 Stankovski V, Dubitzky W: Special section: Data mining in grid computing environments. Future Generation Computer Systems. 2007, 23 (1): 3133. 10.1016/j.future.2006.05.001.View ArticleGoogle Scholar
 Sudholt WB, Kim K, Abramson D, Enticott C, Garic S, Kondric C, Nguyen D: Application of grid computing to parameter sweeps optimizations in molecular modelling. Future Generation Computer Systems. 2005, 21 (1): 2735. 10.1016/j.future.2004.09.010.View ArticleGoogle Scholar
 Mirto M, Epicoco I, Fiore S, Cafaro M, Negro A, Tartarini D, Lezzi D, Marra O, Turi A, Ferramosca A, Zara V, Aloisio G, Donvito G, Carota L, Cuscela G, Maggi GP, La Rocca G, Mazzucato M, My S, Selvaggi G, Scioscia G, Leo P, Di Pace L, Pappada' G, Quinto V, Berardi M, Falciano G, Emerson A, Rossi E, Lavorgna G, Vanni A, Bartoli L, Di Lena P, Fariselli P, Fronza R, Margara L, Montanucci L, Martelli PL, Rossi I, Vassura M, Casadio R, Castrignanò T, D'Elia D, Grillo G, Licciulli F, Liuni S, Gisel A, Santamaria M, Vicario S, Saccone C, Anselmo A, Horner D, Mignone F, Pavesi G, Picardi E, Piccolo V, Re M, Zambelli F, Pesole G: The LIBI Grid Platform for Bioinformatics. Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare. Edited by: Mario Cannataro. 2009, 577613. ISBN: 9781605663746; Published under Medical Information Science Reference, IGI Global. Edited by: Mario Cannataro, University Magna Graecia of Catanzaro, ItalyGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.