Used membrane protein family datasets
As first step of our analysis different datasets were obtained. Two of them were derived from the Pfam database [15]. The first dataset (DS1) consists of 32 membrane protein families which include 2511 proteins with domains of unknown functions (DUF) as listed below.
[PF09767, PF09834, PF09842, PF09843, PF09852, PF09858, PF09874, PF09877, PF09878, PF09879, PF09880, PF09881, PF09882, PF09900, PF09913, PF09925, PF09945, PF09946, PF09971, PF09972, PF09973, PF09980, PF09990, PF09991, PF09997, PF10002, PF10011, PF10067, PF10080, PF10081, PF10097, PF10101]
The second dataset (DS2) consists of 11 membrane protein families with 15644 proteins and 160 known structures as listed below.
[PF00001, PF00002, PF00003, PF00664, PF00939, PF01490, PF02932, PF05602, PF06472, PF06814, PF10192]
After the datasets have been obtained, non-redundant sequences from DS1 and DS2 were generated. To avoid generating misguiding statistics by including identical or highly similar sequences, CD-HIT [16] and BlastClust [17] were applied using by a threshold setting of 25% and 60% respectively. Further, we determined the helical structures in transmembrane regions of the proteins to be investigated, using the TMHMM Server v. 2.0 [18]. Basically, TMHMM performs a prediction of intra/extra-cellular regions and integral membrane helices starting from sequence. Additionally, the probability of the prediction is given for each residue as well. According to the obtained results from TMHMM, a topological state was assigned to each residue. A residue was assigned as ‘TM’ if the posterior prediction probability of this residue being a part of a membrane helix and has been found to be greater than 90%. If the posterior prediction probability of the residue has been found to be greater 90% for extra/intra-cellular prediction, the residue was assigned as ‘nTM’.
Sequence motif extraction
Generally, proteins are large biological molecules they fold into a three-dimensional structure, which is determined by the protein sequence (primary structure) which consists of one or more chains of the 20 canonical amino acids. In the current work only ‘TM’ sequence information was used for our analysis. In this context, short sequence motifs have been extracted which contribute to build the membrane protein structure in the ‘TM’ environment. Each extracted motif can be written in a generalized, regular expression-like form of XYn, where X and Y correspond to amino acids separated by n-1 highly variable positions.
A naive text search algorithm was applied for motif extraction (see Figure 2). Here the algorithm is involved in a step by step window moving process. Beginning from starting position, different defined window sizes lead to several sequence cutouts of matching sizes. Each cutout has been transcribed into the regular expression XYn. More specifically this algorithm returns at each ‘TM’ sequence position i the starting X amino acid and at i + n the ending amino acid Y of the corresponding extracted motif XYn. A resulting list consists of motifs (without duplications) in regular expression XYn form by n={4-7}.
Topology separation and prediction of discriminative motifs
For later evaluation of our frequently occurring motif combinations, we have predicted the topology state of all motifs extracted from ‘TM’ sequence information. About this prediction task, we will figure out which motif is atypical for the ‘TM’ environment. By using a new straight-forward approach of information extracting and clustering this approach addresses the prediction task by determination of the residue conservation at each variable motif position. At first, all single motif occurrences were identified in the non-redundant DS1 and DS2. Including TMHMM predictions, each motif occurrence was assigned to a topology state as previous elucidated. Subsequently, all variable positions within each motif occurrence were examined more closely. Ultimately for each variable position the relative occurrence of each amino acid at the specified position of each motif was calculated and set into relationship to nature occurrence. Like described in [14], the significance of each resulting probability was applied in a log-odd formula. Log-odd values of variable positions were transformed into a vector which ultimately leads to generated logOdd-profiles (LOPs). Based on this LOPs we are fundamentally able to separate each variable motif position to a topology state and finally to predict the topology state of each motif. This approach is discussed in detail in [14].
Information extraction and visualisation from motif architectures
Furthermore, for our statistical analysis of highly occurring consecutive motifs in ‘TM’ regions, a statistical restrictive frame called “motif-architecture” (MA) was defined. In this work a MA specifies that only four directly consecutive motifs are to be considered in each statistical frame. The number of four consecutive motifs depends on the number of ‘TM’ environment occupied residues and the maximum length of a motif defined for this work. In addition directly consecutive motifs means that a motif is ultimately following the previously (Figure 3) without residue gaps between both. Followed by MA analysing from ‘TM’ sequence information a result set with a number of MAs was created. A list of MAs can be assigned to each investigated ‘TM’ region. Relating to further statistical analysis, the decision to apply useful and powerful graph-algorithms causes that each found MA has been considered as a graph structure (see Figure 4). In general, a graph consists of a number of nodes connected by edges. Related to our MA a motif can be considered as a node connected to another node by a weighted edge. The edge weightiness between two nodes depends on the occurrence of edges with same source and target node in all detected MAs. One main graph for each ‘TM’ region has been created by merging all graphs out of the corresponding ‘TM’ list. This leads to the same number of graphs as they are ‘TM’ regions to be analysed. The final step includes the same merging procedure of all ‘TM’-graph to one main-graph included by updating the edge weightiness. So the weightiness of already existing edges was updated by increasing by one. The final main graph includes all motifs as representative nodes connected over weighted edges. By defining an edge weight threshold we are able to reduce the graph by removing less weighted edges and keeping stronger ones. These different steps were applied to DS1, DS2 and selected protein families. This workflow for membrane environment information extraction and transformation is shown in Figure 5.