This article has Open Peer Review reports available.
Graph representation of high-dimensional alpha-helical membrane protein data
© Grunert and Labudde; licensee BioMed Central Ltd. 2013
Received: 7 August 2013
Accepted: 26 November 2013
Published: 2 December 2013
In genomics and proteomics, membrane protein analysis have shown that such analyses are very important to support the understanding of complex biological processes. In Genome-wide investigations of membrane proteins a large number of short, distinct sequence motifs has been revealed. Such motifs found so far support the understanding of the folded membrane protein in the membrane environment. They provide important information about functional or stabilizing properties. Recently several integrative approaches have been proposed to extract meaningful information out of the membrane environment. However, many information based approaches deliver results having deficits of visualisation outputs. Outgoing from high-throughput protein data analysis, these outputs play an important role in the evaluation of high-dimensional protein data, to establish a biological relationship and ultimately to provide useful information for research.
We have evaluated different resulting graphs generated from statistical analysis of consecutive motifs in helical structures of the membrane environment. Our results show that representative motifs with high occurrence in all investigated protein families are responsible for the general importance in alpha-helical membrane structure formation. Further, motifs which often occur with others in their function as so called “hubs” lead to the assumption, that these motifs constitute as important components in helical structures within the membrane. Otherwise, consecutive motifs and hubs which show a high occurrence in certain families only can be classified as important for family-specific functional characteristics. Summarized, we are able to bridge our graphical results from high-throughput analysis of membrane proteins over networking with databases to a biological context.
Our results and the corresponding graphical visualisation support the understanding and interpretation of structure forming and functional motifs of membrane proteins. Our results are useful to interpret and refine results of common developed approaches. At last we show a simple way to visualise high-dimensional protein data in context to biological relevant information.
KeywordsMembrane proteins Motifs Graph Architecture
Proteins are the main catalysts, structural elements, signalling messengers, molecular machines of biological tissues and essential for many fundamental biological processes within organisms . Fundamental biological processes depend on membrane proteins. Membrane proteins fall into a class of proteins whose molecules are attached to or are associated with the membrane of a cell. A variety of biological functions are accomplished by these membrane proteins, such as signal and energy transduction, nutrient transport, the maintenance of ion concentration, ligand binding, and cell adhesion , thereby facilitating their functional importance in many biological processes . Many fundamental cellular processes involve protein–protein interactions, and membrane proteins are no exception. Comprehensively identifying complexes is important to systematically defining protein function , and hints about the function of an unknown protein can be obtained by investigating its interaction with other proteins of known function. Nervous excitement, oxygen supply, energy balance, immune response and the transmission of signals within cells and from cell to cell are the essential of membrane proteins. E.g. membrane proteins form specific receptors on the cell surface and serve as the communication interface between the cell’s external and internal environment . Hormones and other neurotransmitters can bind to these and thereby causing the cell to certain reactions. They play a fundamental role within cellular and physiological processes. Membrane proteins perform different tasks. They can be involved as transport proteins, compound molecules, receptors or enzymes. As structure proteins they determine the cell’s design and ultimately the quality of tissues and the whole body. The ion concentration regulation in the cell and the excitability of nerves and muscles are functions of a membrane protein as ion channel. As transport proteins, they handle vitally important substances like e.g. glucose which is essential for the energy supply in the whole body. The identification of such protein complexes and interactions is valuable, since, on the one hand, detailed information of the function of an unknown membrane protein can be obtained by analysing its interactions with proteins of known function. On the other hand, biological processes can be comprehended as a dynamically fluctuating system, whereby the biological role of the unknown membrane protein can be defined more precisely [1, 5]. In summary, membrane proteins convey the material and information transfer between cells and organ systems. Functional intact membrane proteins are indispensable for human health. They are aim of a large number of drugs and pharmacologically active substances. However, if they exhibit specific defects, they lead to the formation of many known diseases like e.g. Alzheimer’s, Parkinson’s, diabetes insipidus, hereditary deafness, cystic fibrosis, retinitis pigmentosa or cancer [6–8].
The unsolved problem how a protein folds and sequence homology are related can be better understood by sequence motif analyses. Thus, the enormous increase of membrane protein data and protein structures requires the handling of such high-dimensional biological data. In this work, our novel statistical approach shows which motifs contribute fundamentally to be involved as structural or functional sequence parts. Useful graph visualisations will fill the lack of high-throughput protein data analysis and evaluation. Here, we will reveal functional and structural relationships of sequence motifs. Summarized, we inspect structural and functional aspects of sequence motifs within the field of membrane proteins, largely from a computational point of view.
Materials and methods
Used membrane protein family datasets
As first step of our analysis different datasets were obtained. Two of them were derived from the Pfam database . The first dataset (DS1) consists of 32 membrane protein families which include 2511 proteins with domains of unknown functions (DUF) as listed below.
[PF09767, PF09834, PF09842, PF09843, PF09852, PF09858, PF09874, PF09877, PF09878, PF09879, PF09880, PF09881, PF09882, PF09900, PF09913, PF09925, PF09945, PF09946, PF09971, PF09972, PF09973, PF09980, PF09990, PF09991, PF09997, PF10002, PF10011, PF10067, PF10080, PF10081, PF10097, PF10101]
The second dataset (DS2) consists of 11 membrane protein families with 15644 proteins and 160 known structures as listed below.
[PF00001, PF00002, PF00003, PF00664, PF00939, PF01490, PF02932, PF05602, PF06472, PF06814, PF10192]
After the datasets have been obtained, non-redundant sequences from DS1 and DS2 were generated. To avoid generating misguiding statistics by including identical or highly similar sequences, CD-HIT  and BlastClust  were applied using by a threshold setting of 25% and 60% respectively. Further, we determined the helical structures in transmembrane regions of the proteins to be investigated, using the TMHMM Server v. 2.0 . Basically, TMHMM performs a prediction of intra/extra-cellular regions and integral membrane helices starting from sequence. Additionally, the probability of the prediction is given for each residue as well. According to the obtained results from TMHMM, a topological state was assigned to each residue. A residue was assigned as ‘TM’ if the posterior prediction probability of this residue being a part of a membrane helix and has been found to be greater than 90%. If the posterior prediction probability of the residue has been found to be greater 90% for extra/intra-cellular prediction, the residue was assigned as ‘nTM’.
Sequence motif extraction
Generally, proteins are large biological molecules they fold into a three-dimensional structure, which is determined by the protein sequence (primary structure) which consists of one or more chains of the 20 canonical amino acids. In the current work only ‘TM’ sequence information was used for our analysis. In this context, short sequence motifs have been extracted which contribute to build the membrane protein structure in the ‘TM’ environment. Each extracted motif can be written in a generalized, regular expression-like form of XYn, where X and Y correspond to amino acids separated by n-1 highly variable positions.
Topology separation and prediction of discriminative motifs
For later evaluation of our frequently occurring motif combinations, we have predicted the topology state of all motifs extracted from ‘TM’ sequence information. About this prediction task, we will figure out which motif is atypical for the ‘TM’ environment. By using a new straight-forward approach of information extracting and clustering this approach addresses the prediction task by determination of the residue conservation at each variable motif position. At first, all single motif occurrences were identified in the non-redundant DS1 and DS2. Including TMHMM predictions, each motif occurrence was assigned to a topology state as previous elucidated. Subsequently, all variable positions within each motif occurrence were examined more closely. Ultimately for each variable position the relative occurrence of each amino acid at the specified position of each motif was calculated and set into relationship to nature occurrence. Like described in , the significance of each resulting probability was applied in a log-odd formula. Log-odd values of variable positions were transformed into a vector which ultimately leads to generated logOdd-profiles (LOPs). Based on this LOPs we are fundamentally able to separate each variable motif position to a topology state and finally to predict the topology state of each motif. This approach is discussed in detail in .
Information extraction and visualisation from motif architectures
Results and discussion
In summary, we could show that membrane protein families are characterized by individual motifs influenced by their structural and functional properties. Finally, on consideration of all data processing steps including by final visualising and under networking with biological databases, we are able to build a bridge between graph information in conjunction with a biological context.
Generally, in this work it could have been shown how to visualize high-dimensional membrane protein data in form of graph structures and how to fill the lack between high-throughput protein data analyses and evaluation. 32 poly-topic membrane protein families with domains of unknown functions and 11 membrane protein families consisting of receptor, transporter and neurotransmitter-gated ion-channel proteins were analysed. Transmembrane and non-transmembrane sequence regions were predicted using the TMHMM method. Possible sequence motifs of variable lengths have been extracted out of predicted ‘TM’ regions, by using a naive text extracting algorithm. Four immediately consecutive sequence motifs were defined as a statistical frame called “motif-architecture”. Subsequently, multiple numbers of motif-architectures have been extracted out of all ‘TM’ regions, followed by information transformation into graph structures. Motifs as representative nodes connected by weighted edges to other nodes form a graph. All result graphs support the understanding and evaluation of high occurring consecutive motifs of the investigated protein families. This high occurrence of architecture-motifs points to the general importance that these motifs within the respective protein structure are significantly relevant for the membrane protein folding. ‘TM’ region atypical motifs have emerged which point to the general importance as being involved in defining a protein’s function. Here in special, motifs which are involved in the consensus pattern of retinal binding sites of Pfam receptor families. Finally, hub-motifs which often occur together with others point out to indispensable motifs in helical regions.
Because of the stronger protein structure conservation in evolution than the sequential composition of the folded protein chains, there are individual motifs or characteristic sequence parts which expose a certain biochemical function of proteins. This means that membrane protein families are characterized by structural and functional motifs. Thus, it is possible to compare such families by the inclusion of individual sequence motifs.
Conclusive evaluation of our results with biological databases confirms this fact and shows a simple way bridging visualisation of membrane protein data to biological context.
The authors would like to thank the Free State of Saxony and the European Social Fund (ESF) for financial support.
- Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature. 2000, 405 (6788): 823-826. 10.1038/35015694.View ArticlePubMedGoogle Scholar
- Luckey M: Membrane Structural Biology. 2008, Cambridge University PressView ArticleGoogle Scholar
- Singer SJ, Nicolson GL: The fluid mosaic model of the structure of cell membranes. Science. 1972, 175 (23): 720-731.View ArticlePubMedGoogle Scholar
- Venkatakrishnan A, Deupi X, Lebon G, Tate CG, Schertler GF, Babu MM: Molecular signatures of g-protein-coupled receptors. Nature. 2013, 494 (7436): 185-194. 10.1038/nature11896.View ArticlePubMedGoogle Scholar
- Lan N, Montelione GT, Gerstein M: Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level. Curr Opin Chem Biol. 2003, 7 (1): 44-54. 10.1016/S1367-5931(02)00020-0.View ArticlePubMedGoogle Scholar
- Marsico A, Labudde D, Sapra T, Muller DJ, Schroeder M: A novel pattern recognition algorithm to classify membrane protein unfolding pathways with high-throughput single-molecule force spectroscopy. Bioinformatics. 2007, 23 (2): 231-236. 10.1093/bioinformatics/btl293.View ArticleGoogle Scholar
- Childers M, Eckel G, Himmel A, Caldwell J: A new model of cystic fibrosis pathology: lack of transport of glutathione and its thiocyanate conjugates. Med Hypotheses. 2007, 68 (1): 101-112. 10.1016/j.mehy.2006.06.020.View ArticlePubMedGoogle Scholar
- Rowe SM, Miller S, Sorscher EJ: Cystic fibrosis. N Engl J Med. 2005, 352 (19): 1992-2001. 10.1056/NEJMra043184.View ArticlePubMedGoogle Scholar
- Liu Y, Engelman DM, Gerstein M: Genomic analysis of membrane protein families: abundance and conserved motifs. Genome Biol. 2002, 3 (10): 1-0054.View ArticleGoogle Scholar
- Arkin IT: Statistical analysis of predicted transmembrane α-helices. Biochimica et Biophysica Acta (BBA)-Protein Struct Mol Enzymol. 1998, 1429 (1): 113-128. 10.1016/S0167-4838(98)00225-8.View ArticleGoogle Scholar
- Senes A, Gerstein M, Engelman D M: Statistical analysis of amino acid patterns in transmembrane helices: The gxxxg motif occurs frequently, and in association with beta-branched residues at neighboring positions. J Mol Biol. 2000, 296 (3): 921-936. 10.1006/jmbi.1999.3488.View ArticlePubMedGoogle Scholar
- Russ WP, Engelman D M: The gxxxg motif: a framework for transmembrane helix-helix association. J Mol Biol. 2000, 296 (3): 911-919. 10.1006/jmbi.1999.3489.View ArticlePubMedGoogle Scholar
- Senes A, Engel DE, DeGrado WF: Folding of helical membrane proteins: the role of polar, gxxxg-like and proline motifs. Curr Opin Struct Biol. 2004, 14 (4): 465-479. 10.1016/j.sbi.2004.07.007.View ArticlePubMedGoogle Scholar
- Grunert S, Heinke F, Labudde D: Structure topology prediction of discriminative sequence motifs in membrane proteins with domains of unknown functions. Struct Biol. 2013, 2013: 10-View ArticleGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The pfam protein families database. Nucleic Acids Res. 2012, 40 (Database issue): 290-301.http://dx.doi.org/10.1093/nar/gkr1065,View ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Mol Biol. 1990, 215 (3): 403-410.View ArticleGoogle Scholar
- Sonnhammer EL, von Heijne, Krogh A: A hidden markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol. 1998, 6: 175-182.PubMedGoogle Scholar
- Schiffer M, Edmundson AB: Use of helical wheels to represent the structures of proteins and to identify segments with helical potential. Biophys J. 1967, 7: 121-135. 10.1016/S0006-3495(67)86579-2.View ArticlePubMedPubMed CentralGoogle Scholar
- Schuster-Böckler B, Schultz J, Rahman S: Hmm logos for visualization of protein families. 2004,http://dx.doi.org/10.1186/1471-2105-5-7,Google Scholar
- Sigrist CJ, de Castro E, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I: New and continuing developments at prosite. Nucleic Acids Res. 2013, 41 (D1): 344-347. 10.1093/nar/gks1067.View ArticleGoogle Scholar
- Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P: Prosite: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 2002, 3 (3): 265-274. 10.1093/bib/3.3.265.View ArticlePubMedGoogle Scholar
- de Castro E, Sigrist CJ, Gattiker A, Bulliard V, Langendijk-Genevaux PS, Gasteiger E, Bairoch A, Hulo N: Scanprosite: detection of prosite signature matches and prorule-associated functional and structural residues in proteins. Nucleic Acids Res. 2006, 34 (suppl 2): 362-365.View ArticleGoogle Scholar
- Sigrist CJ, De Castro E, Langendijk-Genevaux PS, Le Saux, Bairoch A, Hulo N: Prorule: a new database containing functional and structural information on prosite profiles. Bioinformatics. 2005, 21 (21): 4060-4066. 10.1093/bioinformatics/bti614.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.