Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns

Background The three-dimensional structure of a protein is an essential aspect of its functionality. Despite the large diversity in protein structures and functionality, it is known that there are common patterns and preferences in the contacts between amino acid residues, or between residues and other biomolecules, such as DNA. The discovery and characterization of these patterns is an important research topic within structural biology as it can give fundamental insight into protein structures and can aid in the prediction of unknown structures. Results Here we apply an efficient spatial pattern miner to search for sets of amino acids that occur frequently in close spatial proximity in the protein structures of the Protein DataBank. This allowed us to mine for a new class of amino acid patterns, that we term FreSCOs (Frequent Spatially Cohesive Component sets), which feature synergetic combinations. To demonstrate the relevance of these FreSCOs, they were compared in relation to the thermostability of the protein structure and the interaction preferences of DNA-protein complexes. In both cases, the results matched well with prior investigations using more complex methods on smaller data sets. Conclusions The currently characterized protein structures feature a diverse set of frequent amino acid patterns that can be related to the stability of the protein molecular structure and that are independent from protein function or specific conserved domains. Electronic supplementary material The online version of this article (doi:10.1186/s13040-015-0038-4) contains supplementary material, which is available to authorized users.

Dear editor, Thank you very much for reviewing the submission of our paper "Mining the entire Protein DataBank for frequent spatially cohesive amino acid patterns".We would also like to thank the reviewer for his remarks and have changed the manuscript to accommodate his comments.
Below, you will find a complete description of all of the changes made (italic, dark--grey text are the reviewer's comments) and responses to each of the reviewer's comments.In the revised manuscript, all changes are highlighted in red font color.
We thank you for handling our manuscript and hope that the changes we have made can be considered as adequate solutions to the reviewer's comments.

Sincerely yours,
On behalf of the authors,

Pieter Meysman
The work presents and elegant and simple idea, which I believe has great potential.
The experimental design seems to be somewhat lacking, however.Here are my suggestions: Major revisions 0) The manuscript seems to indicate that, unlike in previous publications, the authors have excluded the secondary structure from the labels applied to the residues.Some details on the reasons for this change would be interesting.
The reviewer is correct in his statement that the secondary structure information was included in conference proceedings where the computational algorithm was introduced.However at that time the miner had not yet been optimized and the discovery of the smallest enclosing ball consisting of highly frequent items within a single protein structure took considerable computation time due to the many possible combinations that are possible.The secondary structure was therefore appended to reduce the frequency of each item.In the latest version the algorithm has been considerably improved as explained in [1] and frequency reduction is no longer necessary.In fact by removing this information from the mining step, we do not need to exclude any protein structures in PDB where this information is not available or incomplete.In addition, this removes any bias that may exist with the mined FreSCOs towards specific secondary structures.A brief summary of this motivation has been added to the main manuscript, which reads: The utilized Apriori--like algorithm [25] speeds up this pattern search by using several properties of the support and cohesion metrics [23].As the most efficient version of this algorithm is used for this study, there is no longer a necessity of explicitly including secondary structure information in the miner.In this manner the data set is not limited by any additional annotation being available and allows inclusions of a much larger set of protein structures.
1) Some details on the process to choose the used parameters (0.60 support, 4.5A radius) would also shed some light on the whole analysis.
We have added a segment at the start of the result section explaining our choice of parameters.It reads: Amino acids in close proximity are defined by a maximum cohesive radius of 4.5 Å for the purposes of this analysis, which corresponds to the distance between the Cα atoms of two residues that are typically considered as interacting [30].The support of the patterns is set to 0.60, much lower than the frequency of the individual amino acids [see additional file 10] and further reduction of this parameter does not reveal any additional patterns.
2) The redundancy of the dataset seems to be rather high.Any group of sequences sharing more than 50% identity will have similar structure (Baker & Sali 2001, Science).This will create a redundancy problem with the FreSCOs, causing a potentially bias distribution.
In response to this comment, we have provided more information about the manner in which we constructed the non--redundant protein structure data set in the material and methods.It reads: The first is the collection of all structures contained within the RCSB PDB database, obtained on the 3 rd of May, 2013 [17].Only non--redundant protein sequences were retained as annotated by the Vector Alignment Search Tool in the non--redundant PDB chain set at different sequence similarity cut--offs [22].The largest set of proteins that was considered in this manner, where structures with a BLAST p--value lower than 10 --80 were considered redundant, contains 32 142 protein molecules from a large variety of organisms.As using more stringent redundancy cut--offs had little effect on the resulting patterns [see additional file 10], we chose to use the largest set for increased statistical power.
Further we have added a supplementary analysis where we reduce the redundancy of our data set and rerun the same analysis.The results for the smaller but less redundant data sets are almost identical to those detailed before.
There is very little impact on the cohesion and frequency of the found patterns.
Repeating the significance analysis for each data set independently results in mostly the same FreSCOs as being significantly more cohesive than the background would suggest.These findings can be found in additional file [X].
3) The limitation of the number of residues on 3 is not clearly understood.
We choose to focus most of our analyses on FreSCOs with three or more amino acids as they are unique to this approach and provide more interesting patterns.
Studying single amino acids or paired amino acids can and has been done using other methods.In these cases the FreSCO methodology offers no additional advantages, and any study specifically focused on these types of patterns will likely benefit from using more dedicated approaches, such as the typical protein contact maps.We therefore limit our description of singleton and duplet to a comparison in increased information content against triplet and higher patterns as described in the first section of the results.For increased clarity, we have added a sentence at the end of this section, which reads: Given their increased information content and their intrinsic novelty compared to patterns found with contact map approaches, the triplet (or larger) patterns will be the main focus of the next analyses.
4) Long--range contacts are very briefly described on the last sentence of the "The discovered patterns..." section.They could arguably be, however, some of the most important advantages that a spatial algorithm could have over a sequence--based one.There is almost no mention however on the differences between these type of contacts in contrast to short--range ones (<= 6 res?), which are probably much less interesting secondary structure ones.
We agree with the reviewer that one of the key advantages of the spatial mining algorithm that we propose is that it is able to find and describe long--range contacts.To this end, we have added an analysis of the sequence distance between the residues that match the mined FreSCOs in the relevant section.It reads: Furthermore many FreSCOs match residues that are in close spatial proximity but at a great distance on the protein sequence (i.e.long--range interactions [33]), even though the pattern miner does not impose such a constraint.The average distance between residues matching FreSCOs is around 20 amino acids along the protein sequence chain for each rule.The distribution of these distances is distinctly bimodal [see additional file 9], with a separation between short--range (less than 6 residues) and long--range (more than 10 residues) interactions.Most FreSCOs have an equal amount of short--range and long--range matches, which is to be expected, as the mined patterns do not consider the distance along the protein chain when evaluating best matches.

5)
The presented conclusions about extreme temperature organisms would be highly interesting, but it's hard to understand the "take home message" of this section.The previously mentioned redundancy is probably playing a role in skewing the distribution of mined FreSCOs.Also, the tool would seem rather limited for such an in--depth biological analysis, at least in the way it is currently presented.
We understand the reviewer's concern that the redundancy in our dataset may have impacted this analysis.However as replied to comment 2, a new analysis has shown that there does not seem to be an impact of redundancy on the mined FreSCOs.In addition, we agree that the presented analysis is far from in--depth into the stability of protein structures at higher temperatures.Such an analysis would likely be a paper on its own.The goal here is to discover the functionality of the mined patterns.It is to solve the question why we find these FreSCOs and not other amino acid combinations in the protein structures.This is why we start from our mined patterns from the entire data set and not a specific targeted approach on the high temperature proteins.We have reworked parts of the section in the results to clarify this fact and added a few sentences to the conclusions to underline the "take home message" from these results.It now reads: In general, comparison to growth temperatures reveal that hydrophilic residues were mostly found to be related with low temperatures, and hydrophobic, acidic and basic residues to high temperatures.Studying these temperature relationships with FreSCOs allows description of their synergistic tendencies among different amino acids and provides some indication of positional context, as was seen for FreSCOs containing glutamate and lysine in combination with hydrophobic residues.Further the enrichment of specific FreSCOs at higher temperature supports earlier conclusions that the mined patterns play a critical role in protein structure stability 6) The Pfam based analysis of domains is hard to follow.At first glance, 7 FreSCOs do not seem to be enough to justify the claim that the mined patterns are distributed along the full structure.Some clarifications on this process would be helpful.
We have reworked several key sentences in this section of the results to improve readability and understandability.We have also reworked the main conclusions away from the distribution along the entire protein structure, and more towards the non--clustering of all patterns at specific conserved regions.These findings were intended to show that the FreSCOs are not simply the result of conserved regions, for example due to high redundancy in our dataset, but that some actually occur outside of known conserved regions.

Background section:
--the sentence beginning with "For example" is somewhat confusing, in particular the use of the term "protein structure" twice.
This has been changed in the manuscript text.--On the "Another possible reason" sentence, removing the first "information" and changing the second to "...terms of information..." would improve readability.
This has been changed in the manuscript text.
This has been changed in the manuscript text.

"Specific patterns correlate" section:
--On the "It is already well known..." sentence, the "well known" and "is known" phrases make the sentence difficult to read.This has been changed in the manuscript text.
This has been changed in the manuscript text.
--"One can easily envision..." sentence: remove the second "both".This has been changed in the manuscript text.

Discretionary revisions
Mentioned as future work, but I'd like to point out that a functional layer of analysis, perhaps based on the Gene Ontology, would greatly improve the impact of the paper.
Following this request by the reviewer, we have added a gene ontology analysis of the mined FreSCOs.Here we found that several gene ontology terms are significantly correlated with the presence of specific FreSCOs.Many of the FreSCO associations can be explained as being functional, such as the hydrophobic FreSCO patterns that are correlated to proteins annotated as present in the cell membrane.Likely a more extended and focused functional analysis using the spatial mining algorithm that we've presented here will be able to uncover new patterns missed by previous analyses.However such an endeavor is left for future work.The new paragraph that has been added to the Results section of the paper reads: A gene ontology analysis reveals that certain gene ontology terms are highly enriched for specific FreSCOs with a clear functional role [see additional file 8].For example, protein structures annotated as membrane proteins are highly enriched for FreSCOs consisting of mainly hydrophobic amino acids, such as PHE--VAL--LEU.The majority of significant FreSCO associations arises from the molecular function ontology tree.A large amount of FreSCOs are enriched in proteins that bind nucleotides, or have a transferase or oxidoreductase activity.An exhaustive review of all enriched FreSCOs in these gene ontology terms would exceed the scope of this paper.However for the nucleotide-binding proteins many FreSCOs include arginine, which is known to mediate a large number of nucleotide interactions [35,36].

-
-the sentence beginning with "A common..." should read "A common approach to find*ing* such patterns involves *the* transformation..." This has been changed in the manuscript text.--the sentence beginning with "Significant research..." may read better as "*A* significant *amount* of research has therefore gone into solving *this problem*..." This has been changed in the manuscript text."Triplet patterns feature..." subsection: