Biofilter: overview
As mentioned, Biofilter has three primary analysis modes which each make use of the available biological knowledge in slightly different ways: Filtering, Annotation, and Modeling. For the purpose of annotation and filtering, Biofilter takes a list of loci or a list of regions, and then either filters that list by another list (whether provided as input or using information from LOKI), or annotates the list provided. For the development of models based on existing biological knowledge, the input is a list of loci (whether SNPs or chromosome and base pair locations), or a list of regions (such as genes), and these are first mapped to known protein-coding gene regions by Biofilter. In the case of CNVs, Biofilter by default maps CNVs to genes by considering CNVs with even one base-pair of overlap as mapped to a given gene. The researcher can change the degree of overlap required for CNVs to map to genes according to preference. Connections are then automatically forged between this resultant list of genes, and any instances of these genes within the sources of LOKI. As a result, starting from a list of loci or regions, Biofilter connects that list to additional existing information. Figure 4 provides a diagram of how this filtering, annotating, and modeling with Biofilter works.
We provide here further details and examples of filtering, annotating, and modeling within Biofilter, although it is important to note that annotation, filtering, and modeling with Biofilter are not exclusive, and can be combined to analyse data according to a researchers preferences.
Filtering
The most straightforward of Biofilter’s primary functions is, as the name implies, filtering. Given any combination of input data, Biofilter can cross-reference the input data using the relationships stored in the knowledge database to generate a filtered dataset of any supported type (or types). For example, a very straightforward use of Biofilter would be to obtain the list of all genes within a specific data source in LOKI, as visualized using the simulated knowledge database in Figure 5. Another example of filtering with Biofilter would be to use a list of SNPs (such as those covered by a genotyping platform) and a list of genes (such as those thought to be related to a particular phenotype) and then using Biofilter to request the set of SNPs existing within those genes. Biofilter will use LOKI’s knowledge of SNP positions and gene regions to filter the provided SNP list, removing all those that are not located within any of the provided genes. Figure 6 shows an example of this using the simulated knowledge database, and the resultant filtered output.
The output data type does not necessarily have to be the same data type(s) provided as input. For example, a researcher can provide a list of SNPs and a list of groups and request the set of genes that match both lists. In this case, there is no input set of genes to use as a starting point so Biofilter will check all known genes found in the knowledge database. The result is a list of only the genes which include at least one of the specified SNPs, and are a part of at least one of the specified groups.
Finally, filtering is not limited to a single data type: Biofilter can also identify all of the unique combinations of data types which jointly meet the provided criteria. For example, given a list of SNPs and genes, Biofilter can produce a filtered set of SNP-gene pairs. The result is every combination of SNP and gene from the two lists where the SNP is within the gene, or within a user-defined window around the gene.
Annotation
Biofilter can also annotate any of the supported data types with respect to any of the others. Like filtering, the annotations are based on the relationships stored in the knowledge database; unlike filtering, any data which cannot be annotated as requested (such as a SNP which is not located within any gene) will still be included in the output, with the annotation columns of the output simply left blank.
For example, a list of SNPs can be annotated with positions to generate a new list of the SNPs with extra columns containing the chromosome and genomic position for each SNP (if any), we show an example in Figure 7. Any SNP with multiple known positions will be repeated, and any SNP with no known position will have blanks in the added columns.
Similarly, those same SNPs can be annotated with gene information; the result is similar, except that the added column contains the name of the gene containing the SNP’s position. In this case a blank value can mean two things: either the SNP does not fall within any known gene region, or the SNP has no known position with which to search for gene regions. Figure 8 shows an example of this kind of annotation, using the simulated knowledge database. For another example, a researcher could also annotate a list of gene symbols with SNPs, regions, groups, and sources, using Biofilter.
Annotations can also be generated for combinations of data types, or for data types which were not provided as input. In these cases the annotation will be for the output of a filtering analysis. For example, a researcher could provide a list of SNPs and a list of groups, and then request an annotation of genes to regions. Since no genes were provided as input, Biofilter will first identify all genes which contain at least one of the provided SNPs, that are also part of at least one of the provided groups. This filtered set of genes will then appear in the first column of the annotation output, followed by each gene’s genomic region (if any).
As an example use for region based data, such as copy-number variation data, base pair start-and-stop regions can be provided to Biofilter, and then that data can be annotated with gene information using Biofilter, based on percent of overlap or number of base-pairs overlapped.
Filtering can also be followed by annotation. For example, Biofilter can be used to find the overlapping SNPs between the two lists and then map the overlapping SNPs to genes, regions, groups and the sources.
Modeling
The last of Biofilter’s primary analysis modes is a little different from filtering and annotation. In addition to simply cross-referencing any given data with the other available prior knowledge, Biofilter can also search for repeated patterns within the prior knowledge that might indicate the potential for important interactions between SNPs or genes.
Any pathway, ontological category, protein family, experimental interaction, or other grouping of genes or proteins represents a relationship between those genes or proteins. Two genes appearing together in more than one grouping are likely to have an important biological relationship, and two genes appearing in multiple groups from several independent sources are even more likely to be biologically related in some way.
Biofilter modeling is “gene-focused”, and can take any combination of input data, map that data to genes, then search LOKI for likely pairwise interaction models. Thus, a list of SNPs can be developed and gene-gene models can be requested from Biofilter; Biofilter will then only consider models in which the genes contain at least one of the specified SNPs. For another example, if SNPxSNP models are requested, Biofilter will take each baseline gene-gene model, separately map the two genes to all applicable SNPs, and then return all possible pairings between those two sets of SNPs.
The resultant models suggested by Biofilter are ranked in order of likelihood, using an “implication index.” This score is simply a combination of two tallies: the number of original data sources which contained the pair, and the number of different groups among those sources. For example a score of “2-3” indicates that the model appears in three different groups, and those groups originated with two different sources.
For an example, perhaps a researcher has provided a list of SNPs, all of the SNPs on the first “chromosome” of the simulated knowledge base in Figure 3. These SNPs are found within two sources and eight pathways shown in Figure 3. The researcher would like to generate pairwise SNP-SNP models using Biofilter. So, after supplying the list of SNPs, Biofilter will first map the input list of SNPs to genes within Biofilter. Note in Figure 3 that Gene F does not contain any SNPs, so Gene F will not be included in the resultant Gene-Gene models, shown in Figure 9. Next the genes that contain SNPs in the input list of SNPs will be connected pairwise. Biofilter will determine that genes A and C are found together in three groups across two sources, the light and paint sources contain groups—blue, gray, and cyan—that suggest a relationship between genes A and C, as seen in Figure 9. Thus, this relationship is summarized by the implication score “2-3,” which gives the number of sources followed by the number of groups which support this gene model. Each time the same pairwise model of genes is found in another source, the left-hand index of the implication score for that pairwise model increases by one; each time it is found in another group from the same source, the right-hand index increases by one. In the last step, the gene-gene models are broken down into all pairwise combinations of SNPs across the genes within sources light and paint, as seen in Figure 9. Biofilter 2.0 will automatically generate gene models prior to generating SNP models and there is no need to specify any of these steps separately.
Using resultant models
A researcher can choose an implication score cutoff of choice, balancing the number of associations to perform with the implication support of models of interest. Then the researcher can use their statistical approach of choice for investigating the significance of the interaction models.
Ambiguity and biofilter
One of the changes to Biofilter 2.0 is handling ambiguity for genes or groups. Any given gene or group might go by many different names in different contexts, and the new version of Biofilter can accommodate this ambiguity depending on researcher preference. For example, there are names associated with more than one gene; these names are considered ambiguous. For example, although A1B is an alias of the gene A1BG, it is also an alias of the gene SNTB1 (syntrophin, beta 1). Therefore if A1B appears in an input gene list file, Biofilter will not inherently recognize which gene was intended for inclusion (A1BG or SNTB1).
It is important to note that SNP annotations to genes will not change from source to source, SNP identifiers will either map to genes (depending on the gene boundaries set by the user), or SNPs will not map to genes. The user is provided with feedback indicating SNPs, input to Biofilter, that are not mapped to genes.
When an ambiguous gene or group identifier appears in an input file, Biofilter has two options: include all genes or groups with which the identifier is associated, or none of them. A warning is displayed in either case, and options are also available to generate a detailed report of the ambiguous identifiers.. Thus, for the A1B example, the researcher can decide if they will map A1B to A1BG and SNTB1, and keep both genes in further analyses, or drop both out of further analyses, through choice of the option ALLOW_AMBIGUOUS_GENES. Ambiguous group names are only important if the user wishes to provide an input list of groups in order to limit their analysis. If the user provides an ambiguous group name, Biofilter’s behavior is similar to the case of ambiguous gene names: a warning will be displayed, and Biofilter will either include all groups which match the name or none of them, according to the option ALLOW_AMBIGUOUS_GROUPS.
For the gene identifier data within the prior knowledge sources of LOKI however, the situation can become more complicated because many sources provide more than one identifier for each member of a group. For example in a KEGG pathway definition, each gene that makes up the pathway is specified both by its Entrez Gene ID number and by its symbolic abbreviation. If either of the pair of identifiers are connected to more than one gene, or the pair of identifiers are connected to different genes, then it is impossible for LOKI to know with certainty which gene is supposed to be part of the group.
Rather than attempting to compromise on a “one size fits all” approach to this ambiguity, Biofilter supports several different options for interpreting ambiguity. Each of these interpretations comes with a slightly different trade-off between false-positives and false-negatives, and the number of resultant models. The ambiguity interpretation most appropriate to the task can be selected by the researcher at run-time, as Biofilter’s results can change depending on the choice for handling ambiguity.
The most conservative approach is to simply disregard any data which is ambiguous. This ensures that Biofilter will not report any false-positive annotations or models, but true annotations may be missing from the output as a result. This “strict” interpretation is the only one that was supported in earlier versions of Biofilter, and it remains the default mode in Biofilter 2.0.
At the opposite extreme, when there is any doubt about which gene belongs in a group, Biofilter can proceed as if every candidate is a member of the group. This “permissive” approach ensures that no true annotation will be missing from the output, but it will also cause false annotations to be reported.
Between these two extremes, Biofilter also supports two different heuristic strategies for reducing ambiguity. These strategies essentially make an educated guess about what the original data source intended by the set of identifiers it provided. The first heuristic is called “implication” and it rates the likelihood of each potential gene being the intended one by counting the number of identifiers which implicate that gene. The second heuristic, called “quality,” is similar, but considers the number of genes that each identifier refers to as a measure of that identifier’s quality; a high-quality identifier (which refers to only one or two genes) is then given more weight than a low-quality identifier (which refers to many genes).
In practice, these two heuristic strategies will often produce the same results; in fact, when using real data from our real prior knowledge sources, we have yet to find a case where they do not reach the same conclusion. It is possible that such a case will arise in the future, however, so we have incorporated these two heuristics into Biofilter 2.0.
The researcher can indicate which heuristics, if any, should be employed to mitigate ambiguity in the prior knowledge database. The permissible values for this option are “implication” or “quality” to employ a specific heuristic strategy, or “no” or “any”. When set to “no”, no attempt is made to reduce ambiguity and all genes implicated by any of the provided identifiers are considered equally likely interpretations. When set to “any” then all heuristics are attempted simultaneously and the winner(s) from each one are added to the group; if both heuristics chose only a single winner but they disagree with each other, then both would be added, although this has never been observed in practice.
A researcher can choose, in addition to which heuristics they want to use (if any), either a “strict” or “permissive” option. When using the strict option, none of the possible genes will be considered a member of the group if there are multiple possibilities. When enabled with the permissive option, the most-likely possibilities will all be included.
It should also be noted that if the user chooses a heuristic or permissive approach instead of the “strict” default, then some extra (possibly incorrect) annotations or models may be reported as a result of ambiguity, and these will not be differentiated from the other results. If there is any question about the consequences of using ambiguous data, the results can always be compared to the same analysis run in “strict” mode.
We show in Figure 10 and example of ambiguity that is incorporated into the simulated knowledge database, allowing researchers to explore the way output changes when using different ambiguity settings. The simulated knowledge database included with Biofilter contains several examples of potential ambiguity situations, depicted in detail in Additional file 1.
Protein identifiers and ambiguity
So far, our depiction of ambiguity in the knowledge database has implied that groups always contain genes. This allows for the convenient assumption that when we are given more than one identifier for something in a group, we are expecting all of those identifiers to refer to one (and only one) gene.
The reality is, of course, a little more complicated: some sources provide groups that actually contain proteins. In order to make this knowledge compatible with the rest of the prior knowledge, LOKI must translate these protein references into genes, but this breaks that convenient assumption. If a group contains genes then we can reasonably expect each member of the group to be a single gene, but if the group contains proteins, then we must be prepared for a single protein-member to correspond to many genes.
To account for this, LOKI differentiates between identifiers which refer directly to genes (such as symbolic abbreviations or Entrez Gene ID numbers) and identifiers which refer to proteins (such as UniProt ID numbers) that may in turn correspond to many genes.
If any of the identifiers provided for one member of a group is a protein identifier, LOKI disregards any non-protein identifiers. If there is only one protein identifier, then LOKI considers all genes which correspond to that protein to be members of the group, with no ambiguity. If there are multiple protein identifiers then there may be ambiguity if they do not correspond to the same set of genes.
Since protein identifiers are expected to correspond to multiple genes, the concept of an identifier’s “quality” no longer has meaning; consequently, whenever protein identifiers are involved, the implication and quality heuristic strategies become functionally equivalent. In both cases, a gene’s likelihood of being associated with a group is proportional to the number of protein identifiers which implicated it. When no heuristics are used, then all genes which are implicated by any of the protein identifiers are considered equally likely to belong in the group. The simulated knowledge database included with Biofilter also contains several examples of groups with protein identifiers.