Metagenomic surveys of human microbiota are becoming increasingly widespread in academic research as well as in food and pharmaceutical industries and clinical context. Intuitive tools for investigating experimental data are of high interest to researchers.
Knomics-Biota is a web-based resource for exploratory analysis of human gut metagenomes. Users can generate and share analytical reports corresponding to common experimental schemes (like case-control study or paired comparison). Interactive visualizations and statistical analysis are provided in association with the external factors and in the context of thousands of publicly available datasets arranged into thematic collections. The web-service is available at https://biota.knomics.ru.
Knomics-Biota web service is a comprehensive tool for interactive metagenomic data analysis.
The last decade was marked by an explosive growth of experimental data characterizing human-associated microbial communities using metagenomic approach. Previously utilized mainly by the academic community, now metagenomics are used in the industry to assess structure, functions and dynamics of microbiota composition - particularly, to identify the impact of change in diet and medications intake on human microbiota and health. Visual and statistical exploration of important functions of microbiota (like antibiotic resistance  and dietary fiber catabolism [2, 3]) is of particular importance in the global context of publicly collected data. Lower costs and increasing popularity make metagenomics further available to smaller companies and research facilities that often lack dedicated staff bioinformaticians that can perform manual statistical analysis and insight-providing visualization according to state-of-art guidelines [4, 5]. In order to optimize the translation of metagenomic surveys’ results into biomedically important knowledge and advance the global progress in collaborative microbiota research, we developed Knomics-Biota, a web-service for metagenomic data analysis that allows users without advanced skills in bioinformatics and software development to turn their “raw” data into intuitive analytical reports. The datasets can be accompanied with metadata that can include, besides factors like age and clinical status, the factors related to experimental design - distribution between case and control groups, paired correspondence of the samples, etc. After automatic analysis is complete in the cloud, a user is provided with online reports describing all steps of metagenomic analysis - from data quality check and composition profiles to statistical hypothesis testing. Interactive visualization modules allow to explore the interactions between microbiota and factors in detail and propose novel biological hypotheses. Analysis of metabolic potential includes manually curated pathways reflecting gut microbiota functions highly relevant for human health - like synthesis of short-chain fatty acids (SCFAs) and vitamins. It is possible to analyze one’s own data in the context of related precomputed published metagenomes arranged into collections (diet, inflammatory bowel diseases, world populations, etc.). The generated reports can be shared privately with collaborators or publicly and readily to be referred to in scientific publications.
The computational backend of the system is located in the cloud (Additional file 1: Figure S1) and makes use of publicly available software solutions. The front-end interface of the web service is implemented using Yii framework, and interactive visualisations are based on d3js library. The web-service is available at the address: https://biota.knomics.ru. After signing up, a user can upload one’s own metagenomic read sets (obtained using 16S rRNA or “shotgun”/WGS [whole genome] sequencing) accompanied with data description files (metadata).
General logic of Knomics-Biota service includes two components: primary and secondary analysis (Fig. 1). The primary analysis component encompasses basic processing of the reads to obtain microbiota composition profiles. For each of the 16S rRNA and WGS formats, primary analysis component produces feature vectors including relative abundance of microbial taxa at various ranks as well as of gene groups and metabolic pathways according to KEGG Orthology and Enzyme Commission (EC) nomenclatures. Additionally, some functions are analyzed in a dedicated way due to their importance for human health - synthesis of vitamins and SCFAs. These functions are assessed for each sample using curated pathways (Additional file 2: Figure S2).
The primary analysis of 16S rRNA data is performed using QIIME , from reads filtering to defining OTUs (operational taxonomic units). Gene content is predicted using PICRUSt algorithm . WGS data is analyzed using KneadData for quality filtering and HUMAnN  - for taxonomic and functional profiling.
The secondary analysis component implemented in Python v. 3.2 includes statistical analysis of the feature vectors (together with the metadata, if provided) and generating static figures as well as input (in JSON format) for interactive visualization modules. The workflow of the secondary analysis varies depending on the choice of report type by the user (see Fig. 1).
The Basic report is generated initially for any user data. It includes quality check of the “raw” data, assessment of relative abundance of taxa and functional gene groups as well as alpha-diversity. Hierarchical clustering, enterotyping  and metabolic potential prediction are performed. Besides the basic visualizations, interactive modules are provided including heatmap, PCoA (principal coordinates analysis) plot, alpha-diversity plot and co-occurrence network . Each module within Basic and other interactive reports of Knomics-Biota is accompanied with the details of implementation (algorithm and databases used, values of control parameters, etc) so that a user is able to replicate the results independently - as well as to describe the methods in one’s scientific publication.
The bioinformatic algorithms in the secondary analysis include PERMANOVA method for multivariate analysis, regression linear models and U-test for discovering links between microbial features and factors. Outliers are identified using Grubbs’ test and removed from further statistical analysis. Multiple testing adjustment is performed using Benjamini–Hochberg procedure.
Results and discussion
A number of metagenomic analysis pipelines have been developed. They vary in analysis options - by providing only primary “raw” data processing or advanced options as well, allowing different input data formats (16S rRNA sequencing or WGS data). A comparison data is provided in Table 1 highlighting that Knomics-Biota provides a rich repertoire of functions making it superior to alternatives. As seen, only Knomics-Biota and MG-RAST  provide databases of published metagenomes for comparative analysis. Nephele  as well as CosmosID and One Codex platforms provide a similar functionality: “raw” data processing, advanced statistical analysis and visualizations. However, none of them provide interactivity enabling to change parameters of display on-the-fly.
The Knomics-Biota is made free for academic use. For commercial use, special licensing is provided. Time of the free analysis depends on the number of projects in the queue and is likely to change during the evolution of the system, but currently, an analysis of a typical 16S rRNA dataset containing around 100 samples from a single Illumina MiSeq run (as a prevalent input data format) is processed within several hours. Overall, as much as approximately 5000 of 16S rRNA samples can be submitted at once by a user. As for the WGS analysis, due to the high data volume and queue the processing can take longer - for example, around several days for 50–100 WGS metagenomes.
Before starting to upload one’s own data to Knomics-Biota, it is possible to get a glance into the complete set of functions on existing datasets. After logging in anonymously into a demo account, a user is provided with sample analytical reports precomputed for publicly available metagenomic data with meta-data from several large-scale studies examining microbiome in various conditions like colon cancer , inflammatory bowel diseases  and malnutrition  as well as associated with dietary interventions . The list of the external datasets is being regularly updated with newly published metagenomes related to human gut microbiota (as well as other niches).
After signing up and logging in, a user can create a project in his/her account and upload the “raw” data - metagenomic reads in FASTQ format obtained via amplicon (16S rRNA) or WGS. When the uploading process is finished, a user can go on with the analysis - always starting with the Basic report. Unlike the other reports, the Basic report generation does not require neither the metadata nor specification of external context. The report includes the results of quality check, microbiota taxonomic and functional composition profiling and alpha-diversity. Similar existing services often require complex configuration steps from a user, provide only basic analysis functionality  or are highly specialized . After the Basic report has been successfully generated, it is possible to perform advanced analysis. The major report types and their contents are briefly shown in the Fig. 1.
One of the essential functions of Knomics-Biota is the opportunity to analyze user data in the context of thousands of metagenomes from publicly available articles precomputed using the same pipeline. The collection of external datasets is regularly updated. For convenience, they are arranged into collections (contexts) according to their topic. The major microbiota topics include inflammatory bowel diseases (IBD), diet, fecal mass transplantation (FMT), antibiotics, world populations, Parkinson’s disease, and so on. Accordingly, while it is possible to compare one’s own data against all metagenomes in Knomics-Biota database, it is often reasonable to limit the analysis to the relevant context - using the External comparison report (without user metadata) or Meta-analysis report (with user metadata provided). When the analysis is complete, a user is notified via email.
When the information on the membership of each samples in case or control group is uploaded, the corresponding Case-control report becomes available - allowing to compare these datasets statistically and visually - similar to the scenario of External comparison. The functionality of interactive modules is extended to allow comparison of the microbiota composition between the two groups. Statistical analysis is performed to identify the respective significant differences. Besides the basic composition features, gut microbiota-specific characteristics of interest are evaluated and compared between the groups: these include metabolic potential for synthesis of vitamins and SCFAs. Paired analysis report has a workflow similar to a case-control scenario but modified to account for paired type of data (for instance, the metagenomes obtained from the same subjects before and after antibiotic therapy).
A Factor analysis report is generated if metadata with extrinsic/intrinsic factors is provided. The service performs multifactor analysis to identify significant associations between microbiota composition and factors like age, body-mass index (BMI), clinical status, etc. The interactive modules are extended to include controls over the display of these factors aiding in exploratory analysis. Additionally, a separate type - Time series report - is dedicated to the examination of consecutively grouped samples including specific algorithms like taxon stability analysis and visualizations of these points.
To facilitate collaborative research, Knomics-Biota allows to adjust access control. By default, the uploaded data and generated reports are only visible to the user. However, it is possible to share any of the reports globally in view-only mode (using a permanent link) or to share the project privately to collaborators registered in the service.
Knomics-Biota service is a convenient tool for collaborative exploratory analysis of metagenomes in the context of publicly available data. Thematic collections of metagenomes focused on microbiota in specific diseases and of world populations, the impact of dietary and medical interventions are useful for comparative surveys and data validation. Besides gut microbiota, the system is ready for processing metagenomes from an arbitrary environment allowing users with and without expertise in bioinformatics to gain insights into system biology of complex microbial communities.
Yarygin K, Tyakht A, Larin A, Kostryukova E, Kolchenko S, Bitner V, Alexeev D. Abundance profiling of specific gene groups using precomputed gut metagenomes yields novel biological hypotheses. PLoS One. 2017;12(4):e0176154.
We thank Data Laboratory for the development of the interactive modules, Go4ward for the web-site engine development, Dmitry Rodionov and Andrei Osterman (Sanford Burnham Prebys Medical Discovery Institute) for help with the curation of metabolic pathways.
This work was supported by the Fund for Development of the Center for Elaboration and Commercialization of New Technologies ‘Skolkovo’ [# G94/16 to Knomics LLC].
Availability of data and materials
Authors and Affiliations
Research and Development Department, Knomics LLC, Skolkovo Innovation Center, Moscow, Russian Federation
Daria Efimova, Anna Popenko, Anatoly Vasilyev, Ilya Altukhov, Nikita Dovidchenko, Vera Odintsova, Natalya Klimenko, Robert Loshkarev, Maria Pashkova, Anna Elizarova, Viktoriya Voroshilova, Sergei Slavskii, Yury Pekov, Ekaterina Filippova, Tatiana Shashkova, Evgenii Levin & Dmitry Alexeev
Computer Technologies Laboratory, ITMO University, Saint Petersburg, Russian Federation
Alexander Tyakht & Dmitry Alexeev
Faculty of Biological and Medical Physics, Moscow Institute of Physics and Technology (State University), Moscow, Russian Federation
Ilya Altukhov, Maria Pashkova, Anna Elizarova, Viktoriya Voroshilova, Sergei Slavskii, Tatiana Shashkova & Evgenii Levin
Life Sciences Department, Skolkovo Institute of Science and Technology, Moscow, Russian Federation
Biology Department, Lomonosov Moscow State University, Moscow, Russian Federation
Institute of Cytology and Genetics, Novosibirsk State University, Novosibirsk, Russian Federation
Institute of Protein Research, Russian Academy of Sciences, Pushchino, Moscow, 142290, Russia
AT and DA supervised the work. IA, AV, RL and ND designed the architecture of the web service. AV, IA, DE, AT and YP managed the team work. DE, NK, IA, AV, AP, ND, VO, RL, MP, AE, VV, SS, EF, TS and EL developed the software. NK, DE, ND, MP, AE, VV, SS and EL collected, curated and processed the data. AP, AT and DE prepared the manuscript. All authors have read and approved the final manuscript.
Figure S2. Manually curated vitamin biosynthesis pathways used in the analysis. (PDF 1598 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.