Metagenomic resources

The Mazumder lab has developed several open source resources for metagenomic analysis, listed on this page.

GutFeeling Knowledgebase (GFKB)

We have developed a proof-of-concept gut microbiome monitoring system using a sequencing and analysis pipeline implemented during our previous I-Corps award (see below).

We have collected from the individuals enrolled in our study the following: three separate fecal samples for metagenomic sequencing, anthropometric measurements, a diet history questionnaire, gastrointestinal symptoms questionnaires, perceived stress questionnaires, physical activity questionnaires, and sleep questionnaires. We have also begun the analysis of fecal samples from the Human Microbiome Project and the associated metadata. The integration of this data into a single knowledgebase of comparable samples using our optimized pipeline will provide the real value of our prototype.

Objective

The Gut Feeling Knowledgebase (GFKB) is a reference database of human gut microbiomes from both healthy individuals and those diagnosed with a disease or condition. The GFKB is generated by a metagenomic analysis pipeline described in our paper (doi: 10.1371/journal.pone.0206484), and includes three tools which are integrated in the HIVE platform. The aim of this database is to catalog bacterial organisms found within the human digestive tract as the lab continues to conduct metagenomic research on various diseases and conditions (e.g. epilepsy and pre-diabetes). Our hope is to identify key organisms found within the gut as a means to understand how their imbalances may impact human health. Currently, the HIVE lab has documented over 500 bacterial organisms within this database, and we hope to continue adding more organisms as we proceed with our current project in predicting intervention outcomes of pre-diabetic patients using their relative gut microbiome abundances.

GFKB downloads

Version	Content Files	Format	File Size	Release Notes (Plain Text)	Date Created
v1.0	RLDA KI Analysis	pdf	393KB	N/A	May 14 2021
v1.0	ML MatLab Tutorial	pdf	5.2KB	N/A	January 6 2021
v5.0	GFKB_v5-PreDiabetes.csv	csv	57KB	GutFeeling Knowledge Base Notes v5.0	January 18 2023
v4.0	GutFeelingKnowledgeBase-v4-Master_List.csv GutFeelingKnowledgeBase-v4-Epilepsy_Data.csv	csv	290KB 99KB	GutFeeling Knowledge Base Notes v4.0	March 31 2020
v3.0	GutFeelingKnowledgeBase-v3.csv	csv	44KB	GutFeeling Knowledge Base Notes v3.0	August 30 2019
v2.6	GutFeelingKnowledgeBase-v2.6.csv	csv	44KB	GutFeeling Knowledge Base Notes v2.6	July 23 2018
v2.6	HumanGutDB-v2.6.fasta-v2.6.csv	fasta	549MB	HumanGutDB v2.6 Notes	July 23 2018
v2.0	GutFeelingKnowledgeBase-v2.0.csv	csv	249KB	GutFeeling Knowledge Base Notes v2.0	2017
v2.0	HumanGutDB-v2.0.fasta	csv	533MB	HumanGutDB v2.0 Notes	2017
v2.0	blockList-v2.0.csv	csv	16KB	Black List Notes v2.0	2017
v2.0	unalignedContigsGFKB-v2.0.fasta	fasta	3.2GB	Unaligned Contigs GFKB Notes	2017

Filtered NT

The Filtered NT dataset is generated by excluding sequences from the whole nucleotide file provided by NCBI, based on whether they have unwanted taxonomy names or any child taxonomy name of these unwanted ones.

Metagenomics Pipeline

We use a two-step pipeline for metagenomic analysis; CensuScope and Hexagon. CensuScope is a census-based tool that randomly samples a user-defined number of reads and BLASTs them against a reference DB. Our reference database (a filtered version of NTdb) is the NCBI Nucleotide db with all of the sequences lacking a clear taxonomic lineage filtered out. All artificial sequences have been removed either by our automated filter or manually, once an artificial sequence is identified during post analysis processing Sequences identified by CensuScope are used as references in Hexagon alignments. HIVE-hexagon, a K-mer based aligner, is more sensitive and faster than current standard alignment algorithms. HIVE-hexagon offers a decrease in computational cost, memory requirement and time for processing.

Publications

Please use one or more of the following for citation(s):

King CH, Desai H, Sylvetsky AC, LoTempio J, Ayanyan S, Carrie J, Crandall K, Fochtman B, Gasparyan L, Gulzar N, Howell P, Issa N, Krampis K, Mishra L, Morizono H, Pisegna JR, Rao S, Ren Y, Simonyan V, Smith K, VedBrat S, Yao M, Mazumder R. Baseline human gut microbiota profile in healthy people and standard reporting template. PLOS ONE 2019. doi: 10.1371/journal.pone.0206484
Shamsaddini A, Pan Y, Johnson WE, Krampis K, Shcheglovitova M, Simonyan V, Zanne A, Mazumder R. Census-based rapid and accurate metagenome taxonomic profiling. BMC Genomics. 2014;15(1):918. PMID: 25232094
Santana-Quintero L, Dingerdissen H, Thierry-Mieg J, Mazumder R, Simonyan V. HIVE-Hexagon: High-Performance, Parallelized Sequence Alignment for Next-Generation Sequencing Data Analysis. PLOS One. 2014;9(6):e99033. PMID: 24918764
Simonyan V and Mazumder R. High-performance Integrated Virtual Environment (HIVE) Tools and Applications for Big Data Analysis. Genes, 2014 Sep 30;5(4): 957-981. PMID: 271953

Funding

Current/past: NSF, Otsuka, MGPC

Acknowledgements

We would like to thank the following individuals for their significant work in curation and annotation of the GFKB:

Stephanie Singleton
Lindsay Hopson
Jiuge (April) Yang
Tyson Dawson
Cameron Sabet
Yukta Chidanandan
Valery Simonyan
Nicole Post
Ben Osborne
Sophie Halkett
Miguel Mazumder

Questions / Comments

If you have any questions or comments regarding GutFeelingKB, please contact Raja Mazumder (mazumder@gwmail.gwu.edu).

CensuScope

CensuScope is a tool to rapidly profile metagenomic samples. The tool works by bootstrapping the data, then carrying out subsample aggregation to estimate sample composition. The tool is many orders of magnitude faster than brute force alignment against the NT database, and has greater than 99% accuracy for species present at 1% of the composition or higher. Because the tool is so lightweight, the computational resources needed to run it are minimal. A typical consumer laptop is capable of running the tool (assuming the database to be searched exists on the laptop. The user can adjust the number of iterations and samples per iteration used, or they can use machine learning to determine how many cycles to run.

Code repository: https://github.com/GW-HIVE/CensuScope

Publication:

Amirhossein et al.

slimNT

Because the NCBI nucleotide database ("NT") has grown so big in recent years, it has become difficult to work with. slimNT is an attempt to take a contextually relevant slice of that database, using a hierarchical clustering approach. The steps of this approach are as follows:

1. Take the representative non-viral proteomes at 75% cutoff from PIR
2. Map their proteome IDs to genome accessions and retrieve the genomes
3. If the genus and species are not present in the above list, get the UniProt reference proteome ID
4. Map these proteome IDs to genome accessions and retrieve the genomes

Current Slim NT database:

https://hive.biochemistry.gwu.edu/static/slimNT.fa.gz (32.1GB)

Current Slim NT taxonomy database:

https://hive.biochemistry.gwu.edu/static/slimNT.db.gz (16.8GB)

Publications:

Shamsaddini et al.

Santana-Quintero et al.

Simonyan et al.

Simonyan V, Mazumder R

Funding

LOI_ID#L02496974, NSF_Lineage_Award #1546491