Skip to main content


Figure 1 | BioData Mining

Figure 1

From: Caipirini: using gene sets to rank literature

Figure 1

Outline of the implemented document classification (Part I). The outline of the implemented methodology is demonstrated via an imaginary example: the values presented are not real; they are used just to indicate how the process works. The workflow is presented in steps: in this scenario, (0. Input) the user entered as input mixed identifiers from Ensembl (ENSG), Entrez (ENTRZ), and PubMed (PMID); Sets A and B consist the training set abstracts, whereas Set C is the set of abstracts to be classified; next to the standard background, only terms of type 'Gene/Protein' are selected. Then, (1. Data Collection) the entered data are retrieved and cleaned-up; note that in general there are Entrez identifiers that may falsely pass as PubMed identifiers (e.g., 90990), and that when there are multiple Ensembl-to-Entrez mappings for the same identifier they are all utilized - both cases not demonstrated in the example. Last, for the current version, users have suggested that multiple occurrences of entries and overlaps among sets should not be removed, e.g., for coping with imbalanced datasets. After the user has defined the input (Step 0) and the respective abstracts have been collected (Step 1), the extracted data are forwarded for further processing (Step 2, in Figure 2), for SVM training (Step 3, in Figure 3), and classification (Step 4, in Figure 3); finally, the results are reported to the user (Step 5, in Figure 3).

Back to article page