- Software article
- Open Access
- Open Peer Review
Distributed retrieval engine for the development of cloud-deployed biological databases
© The Author(s). 2018
- Received: 10 August 2018
- Accepted: 12 October 2018
- Published: 12 November 2018
The integration of cloud resources with federated data retrieval has the potential of improving the maintenance, accessibility and performance of specialized databases in the biomedical field. However, such an integrative approach requires technical expertise in cloud computing, usage of a data retrieval engine and development of a unified data-model, which can encapsulate the heterogeneity of biological data. Here, a framework for the development of cloud-based biological specialized databases is proposed. It is powered by a distributed biodata retrieval system, able to interface with different data formats, as well as provides an integrated way for data exploration. The proposed framework was implemented using Java as the development environment, and MongoDB as the database manager. Syntactic analysis was based on BSON, jsoup, Apache Commons and w3c.dom open libraries. Framework is available in: http://nbel-lab.com and is distributed under the creative common agreement.
- Specialized databases
- Federated databases
- Cloud-based databases
The growing rate of biological data generation has produced unprecedented data streams, which regularly renovate our understanding of system biology , as well as alter our practice in healthcare . As biomedical research became transdisciplinary, data across numerous levels of granularities and perspectives has to be acquired and integrated. Moreover, for meta-analysis, data is often gathered from multiple archival databases. To mitigate the growing rate of biodata origination, new frameworks for data acquisition, classification, storage, retrieval and analysis are being developed continuously.
Among the challenges underlying many such frameworks are the heterogeneity of biological data types and the emergence of new relations between data entities . As a result, on top of the traditional primary and secondary databases, specialized databases were developed. Specialized databases include organism-centered datasets [4, 5], biological pathways  and diseases , each with its own data specifications, often curated to serve consortiums or single laboratories. Specialized databases often integrate data from numerous primary, secondary and other specialized databases.
The necessity of finding a piece of data in a vast array of databases led to the development of meta search engines, which are traditionally based on a distributed approach. A distributed search engine is a decentralized service, allocating mining and query generation among numerous edges, integrating the retrieved results in a unified framework, constituting a federated database. For example, the Neuroscience Information Framework (NIF) , one of the most important database federations for the neuroscience community, has been cataloging and surveying the neuroscience resource landscape since 2006. NIF currently gives access to over 250 data sources categorized to different subjects ranging from software tools to funding resources. NIF provides a distributed query engine to specialized data bases, which are independently created and curated. This type of distributed search among independent databases is enabled through NIF’s DISCO registry tool with which a Web resource can send both automatic or manual data updates to the NIF system .
One of the most important stepping stones in modern biodata mining is cloud computing, providing scalable virtualized resources and distributed computing, and enabling optimization of cost and computing efficiency . Thus, cloud providing frameworks, such as IBM Cloud , Microsoft Azure , Amazon AWS  and Google Cloud  are routinely adopted by research groups and organizations. Cloud-based federated data-bases can provide powerful framework for integrated data-centered research. For example, Todor and colleagues developed the ChemCloud, a semantic-web based framework, which integrates specialized local databases with online datasets in the fields of chemistry and pharmacy, aiming at semantic search, semantic data enrichment, ontology-enhanced navigation, machine generated eLearning trajectories and semantic knowledge discovery over multiple databases . O’Connor and colleagues developed the SeqWare, a database framework aiming at handling and querying a wide range of genomic-related data types, utilizing the Hadoop MapReduce environment and the Hadoop HDFS distributed filesystem .
Integrating cloud resources and federated data retrieval engine in the context of the development of specialized databases holds great promise; however, it is not a trivial task. It requires technical expertise in cloud computing, as well as the development of a unified data-model to which different models can be translated. For example, Pareja-Tobes and colleagues developed the Bio4j framework in which heterogeneous proteomic data is modelled with graphs, stored in a cloud and retrieved using domain specific language implemented in Java . An interesting contribution to cloud-based, federated database development, is the development of BioCloud Search EnGene (BSE), by Dessi and colleagues . BSE is a gene-centric distributed search engine, built upon Google App Engine (GAE). GAE provides a distributed data storage service, which performs distribution, replication and load balancing automatically and supports operations to access objects (i.e. create, read, update, delete) by means of an SQL-like language called GQL.
Here we propose a cloud-based framework of a distributed search engine for biological data. Our framework distributes a query (written in a “google-like” fashion) among several strategic web-based biological databases, such as NCBI’s datasets and Malacards, storing the retrieved results over MongoDB cloud service, and annotating them with the query keywords for future retrieval. Our framework provides a Graphical User Interface, with which the user can explore the retrieved data.
The system is comprised of 8 main packages. API packages include the admin and guest packages, each comprised of the appropriate setting parameters, GUI implementation and a communication module. The admin and guest packages make use of specific capabilities, implemented within the ‘hidden’ Persistence, Database and model packages. The server package is also comprised of a GUI and communication capabilities and makes use of all ‘hidden’ packages, including the Parsers and URL packages. UML schematic is shown in Additional file 1: Figure S1.
The admin can initiate a query, defined with a retrieved type, and comprised of an instance of a Database class and a list of Field instances. The database class includes a reference to a specific type of database (which includes all supported databases) and the Field class includes a reference to search fields (e.g. journal, or publication date when the database is Pubmed). UML schematic is shown in Additional file 1: Figure S2. Once a query is defined it is sent to a querybuilder function, where it is translated to the appropriate database-specific structured URL. Following query execution, the retrieved data is analyzed by the appropriate parsers. UML schematic is shown in Additional file 1: Figure S3. Each parser creates an object for each instance of the results. For example, the Pubmed parser will create a list of Article objects, the Malacard parser will create a list of disease objects, etc. All created objects implement both Persistable and Serializable interfaces. The persistable interface provides an encapsulation layer, which unifies all objects that are to be persisted in a cloud. The serializable interface allows easy streaming of objects over the communication ports. UML schematic is shown in Additional file 1: Figure S4. Notably, we built a smart data exploration engine that allows the user to interact with the different visualization tools available in our framework. For example, Structure-derived instances can be visually explored via a direct web-based interface to NCBI’s structure viewer, and Gene-derived instances can be similarly explored via NCBI’s sequence explorer.
Persistable objects can be saved to a MongoDB database following tagging and sync, using our MongoDB persistency object. In order to save data over MongoDB, the user has to create a MongoDB account, establishing a link to a specialized database. Account settings can be configured in our framework using a GUI. This account will be used to save as well as to retrieve data from the database. Communication between our framework and the cloud-based database is managed using the persistence package. This package is comprised of a PersistencySetting class, which manages the cloud configuration settings, and a PersistencyAgent. The PersistencyAgent provides a full API to MongoDB allowing for object storing and retrieval. UML schematic is shown in Additional file 1: Figure S5. Data is stored with JSON files. Once data is saved over the cloud it is available for ‘local’ search by guest users. The MongoDB cloud allows for a non-relation-based data storage, which, as previously mentioned, is more appropriate for biological data. Inside the MongoDB, data is organized in collections. Here we initiated 5 collections – each for every type of retrieved data.
For efficient data retrieval from the cloud, data is retrieved in two phases. First, only partial information is displayed (e.g. name and keywords). Following user request the entire instance is retrieved for exploration. This two phases system dramatically reduces the server workload. Moreover, to make searching more informative, a smart tagging mechanism was defined, with which instances of data are tagged with the keywords that were used to retrieve them. When data is displayed for user investigation, it is displayed with these keywords.
Once the user chooses the ‘save to local’ option, all data is persisted to the MongoDB cloud account and can be monitored via their web interface. For example, the user can monitor the average rate of commands, queries, updated, deletes and inserts. Monitoring example is given in Additional file 1: Figure S6. Moreover, the logical size of the database can be traced in real time (Additional file 1: Figure S7), as well as the number of established connections (Additional file 1: Figure S8), and data-streams (Additional file 1: Figure S9).
The user can explore the stored data in the MongoDB web interface as well. However, data exploration is possible in the free tier subscription only via Mongo Shell (see installation manual for installation and connection details). For example, after connecting to your account you can explore your database with the ‘show dbs’ command:
Detailed description of the MongoDB shell command is given in: https://docs.mongodb.com/v3.4/reference/mongo-shell/. To conclude, we have an integrated environment in which data can be retrieved from multiple databases using our distributed search engine and persisted on a cloud, for future exploration and analysis.
The above illustrates the architecture used to derive data from online databases and store the retrieved data in a cloud-based specialized data-set. And therefore, being a powerful, easy to use, federated data retrieval engine. However, it is tailored to a specific set of datasets (Malacards, Gene, Protein, Structure and Pubmed), which holds together a tremendous amount of data, but can be extended as needed. More importantly, most specialized databases, being curated by consortiums and labs, must incorporate data which originated from their own work, and not from web-based resources. This framework can therefore be extended in two dimensions: (I) supporting a larger set of databases, and (II) incorporating data from lab work into the larger database. Extending supported databases can be implemented in our proposed framework by creating a parser, a query builder and a data object for this purpose. Our design consists of general support for a data base and query building, encapsulating them with the DataBase and Query classes, which are easily extendable (Additional file 1: Figure S2). As our framework is comprised of various parsing libraries for XML, JSON and HTML documents, they can be easily utilized for the creation of a specific database, as demonstrated for all the databases listed above. Furthermore, as long as the data object implements the Persistable and Serializable interfaces, it can be persisted to the cloud-based database, as illustrated before. The Incorporation of lab-originated data instances in the database is as easy as encapsulating it with an appropriate data class, as mentioned above, and persisting it to memory. Since the proposed framework is freely distributed to the community via Github, we anticipate that the supported list for databases will grow.
The ever-growing production and coverage of biomedical data introduces challenges in three dimensions (I) the volume (amount of data), variety (types of data), and velocity (speed required for data processing). These 3 dimensions were popularized as the 3Vs of big data. Accordingly, integrating cloud resources and federated data retrieval engine in the context of the development of specialized databases has the potential to enhance the constant development in databases in the biomedical field. Here we propose an extendable, freely distributed framework, allowing for a distributed search among several strategic web-based biological databases, as well as lab-originated data instances over a cloud-based data center. This framework also provides a graphical user interface, with which the user can explore the retrieved data from both online and cloud-based repositories.
While similar frameworks provide integration of some aspects of cloud resources with distributed search (such as the BioCloud Search EnGene), they are primarily focusing on one specific arena (EnGene for example, is focusing on genomic data). Our framework is (1) open source - it can be easily extended to support different niches as well as provide general framework for biodata retrieval and storage and (2) providing a bridge to a cloud database provider. It is unique in the sense that it is based on free community-supported tools, and that it can be extended further if required.
Our framework is distributed under the creative common agreement. To ensure public access to the framework, the source code was uploaded to GitHub at: https://github.com/NBEL-lab/DistCloudBiodata, and it is also accessible via NBEL-lab.com (software). As described above, the framework uses a series of dependable modules, which are freely accessible.
The authors wish to thank Tamara Pearlman Tsur for her insightful comments.
This work was supported by a JCT research grant.
Availability of data and materials
Out framework can only support in this stage the Windows operating system. The framework, with additional code examples, is provided in NBEL-lab.com website (under ‘software). To ensure public access to the files, the source code was also uploaded to GitHub at: https://github.com/NBEL-lab/DistCloudBiodata. As described above, the framework uses a series of dependable modules, which are freely accessible. A tutorial and installation instructions are located in the framework’s directory.
E.E.T designed the framework and wrote the manuscript, D.B. and I.C designed the framework and wrote the code. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ma’ayan A, Rouillard AD, Clark NR, Wang Z, Duan Q, Kou Y. Lean big data integration in systems biology and systems pharmacology. Trends Pharmacol Sci. 2014;35(9):450–60.View ArticleGoogle Scholar
- Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf sci Syst. 2014;2(1)1–10.Google Scholar
- Tsur EE. Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces. BioData Min. vol 10, 2017(1). https://doi.org/10.1186/s13040-017-0130-z.
- Consortium F. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 2003;31(1):172–5.View ArticleGoogle Scholar
- Stein L, Sternberg P, Durbin R, Thierry-Mieg J, Spieth J. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res. 2001;29(1):82–6.View ArticleGoogle Scholar
- Caspi R, Foerster H, Fulcher CA, Kaipa P, Krummenacker M, Latendresse M, Paley S, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2007;36:D623–31.View ArticleGoogle Scholar
- Rappaport N, Twik M, Nativ N, Stelzer G, Bahir I, Stein TI, Safran M, Lancet D. MalaCards: a comprehensive automatically-mined database of human diseases. Curr Protoc Bioinformatics. 2014;47(1):1–24.View ArticleGoogle Scholar
- Gardner D, Akil H, Ascoli GA, Bowden DM, Bug W, Donohue DE, Goldberg DH, et al. The neuroscience information framework: a data and knowledge environment for neuroscience. Neuroinformatics. 2008;6(3):149–60.View ArticleGoogle Scholar
- Marenco LN, Wang R, Bandrowski AE, Grethe JS, Shepherd GM, Miller PL. Extending the NIF DISCO framework to automate complex workflow: coordinating the harvest and integration of data from diverse neuroscience information resources. Frontiers Neuroinform. 2014;8(58).Google Scholar
- Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of “big data” on cloud computing: review and open research issues. Inf Syst. 2015;47:98–115.View ArticleGoogle Scholar
- Zhu J, Fang X, Guo Z, Niu MH, Cao F, Yue S, Liu QY. IBM cloud computing powering a smarter planet. In: IEEE international conference on cloud computing. Berlin: Springer; p. 2009.Google Scholar
- M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski and G. L. et. al. A view of cloud computing. Commun ACM. 2010;53(4):50–8.View ArticleGoogle Scholar
- Bermudez I, Traverso S, Mellia M, Munafo M. Exploring the cloud from passive measurements: the Amazon AWS case. INFOCOM Proc. 2013;230–4.Google Scholar
- Jia X. Google cloud computing platform technology architecture and the impact of its cost. In: Software Engineering (WCSE), Second World Congress on; 2010.Google Scholar
- A. Todor, A. Paschke and S. Heineke, "ChemCloud: chemical e-Science information cloud," arXiv preprint arXiv:1012.1645, 2010.Google Scholar
- Brian DO, Merriman B, Nelson SF. SeqWare query engine: storing and searching sequence data in the cloud. BMC Bioinformatics. 2010;11(12):S2.Google Scholar
- Pareja-Tobes, R. T. Pablo, M. Manrique, E. Pareja and E. Pareja-Tobes, "Bio4j: a high-performance cloud-enabled graph-based data platform," bioRxiv: 016758, 2015.Google Scholar
- Dessì N, Pascariello E, Milia G, Pes B. BioCloud search EnGene: surfing biological data on the cloud. In: International meeting on computational intelligence methods for bioinformatics and biostatistics; 2013.Google Scholar
- NCBI, "Entrez programming utilities help," 2009. [Online]. Available: https://www.ncbi.nlm.nih.gov/books/NBK25501/.Google Scholar