HIVE Lab - User contributions [en]

FDA-ARGOS WIKI

2026-04-07T19:36:09Z

Christie.woodside:

{{DISPLAYTITLE:<span style="position:absolute; clip:rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
__NOTOC__


<div id="argos-topbanner" style="clear:both; position:relative; box-sizing:border-box; width:100%; margin:1.2em 0 8px; border:1px solid #ddd; background-color:#f9f9f9; color:#000;">
<div style="margin:0.6em; text-align:center;">
<div style="font-size:170%; padding:.1em;">Welcome to the FDA-ARGOS Wiki</div>
<div style="font-size:100%;">
This wiki provides project information, database resources, protocols, publications, and support materials for the FDA-ARGOS initiative.
</div>
</div>
</div>

<div style="clear:both;"></div>


<div style="margin:10px 0; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>Introduction</h2>

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see:
* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.
</div>


<div style="margin:10px 0; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>FDA-ARGOS Database</h2>

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

The organisms in the database tables that have FDA-ARGOS tags can be found in our FDA-ARGOS NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 here]. We also have QC'd and included additional organisms that were publicly available on NCBI. They are logged in this page, [[Additional ARGOS Reviewed Organisms]].

A comprehensive reference table of human pathogenic organisms, including bacteria, viruses, and eukaryotic pathogens, has been created. The table integrates data from multiple curated sources and includes taxonomic identifiers and classification information for downstream analysis and database integration.

* [[Comprehensive Pathogenic Organisms Reference]]

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.
</div>


<div style="margin:10px 0; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>FDA-ARGOS FAQs</h2>

Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].
</div>


<div style="margin:10px 0; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>ARGOS Usage Tutorials and Protocols</h2>

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
</div>

</div>

</div>


<div style="display:flex; flex-flow:row wrap; justify-content:space-between; padding:0; margin:0 -5px 0 -5px;">

<div style="flex:1; margin:5px; min-width:300px; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>Project Publications</h2>

* Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nature Communications'' 10, 3313 (2019). [https://doi.org/10.1038/s41467-019-11306-6 DOI: 10.1038/s41467-019-11306-6]

* ''Coming soon:'' Journal publication highlighting QC metrics and the data dictionary for the ARGOS database.
</div>

</div>

FDA-ARGOS WIKI

2026-04-07T19:33:58Z

Christie.woodside:

{{DISPLAYTITLE:<span style="position:absolute; clip:rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
__NOTOC__


<div id="argos-topbanner" style="clear:both; position:relative; box-sizing:border-box; width:100%; margin:1.2em 0 8px; border:1px solid #ddd; background-color:#f9f9f9; color:#000;">
<div style="margin:0.6em; text-align:center;">
<div style="font-size:170%; padding:.1em;">Welcome to the FDA-ARGOS Wiki</div>
<div style="font-size:100%;">
This wiki provides project information, database resources, protocols, publications, and support materials for the FDA-ARGOS initiative.
</div>
</div>
</div>

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

The organisms in the database tables that have FDA-ARGOS tags can be found in our FDA-ARGOS NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 here]. We also have QC'd and included additional organisms that were publicly available on NCBI. They are logged in this page, [[Additional ARGOS Reviewed Organisms]].

A comprehensive reference table of human pathogenic organisms, including bacteria, viruses, and eukaryotic pathogens, has been created. The table integrates data from multiple curated sources and includes taxonomic identifiers and classification information for downstream analysis and database integration.

* [[Comprehensive Pathogenic Organisms Reference]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
Resources:
* [[ARGOSQC Usage Tutorial|ARGOS One-Click Pipeline Usage Tutorial]]
</div>

</div>


<div style="display:flex; flex-flow:row wrap; justify-content:space-between; padding:0; margin:0 -5px 0 -5px;">

<div style="flex:1; margin:5px; min-width:300px; border:1px solid #CCC; padding:0 12px 12px 12px; box-shadow:0 2px 2px rgba(0,0,0,0.1); background:#f5faff;">
<h2>Project Publications</h2>

* Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nature Communications'' 10, 3313 (2019). [https://doi.org/10.1038/s41467-019-11306-6 DOI: 10.1038/s41467-019-11306-6]

* ''Coming soon:'' Journal publication highlighting QC metrics and the data dictionary for the ARGOS database.
</div>

</div>

FDA-ARGOS WIKI

2026-04-07T19:16:47Z

Christie.woodside: /* FDA-ARGOS Database */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

The organisms in the database tables that have FDA-ARGOS tags can be found in our FDA-ARGOS NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 here]. We also have QC'd and included additional organisms that were publicly available on NCBI. They are logged in this page, [[Additional ARGOS Reviewed Organisms]].

A comprehensive reference table of human pathogenic organisms, including bacteria, viruses, and eukaryotic pathogens, has been created. The table integrates data from multiple curated sources and includes taxonomic identifiers and classification information for downstream analysis and database integration.

* [[Comprehensive Pathogenic Organisms Reference]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6
*Coming Soon: Journal Publication highlighting our QC metrics and data dictionary for the ARGOS Database<br />

FDA-ARGOS WIKI

2026-04-07T19:15:39Z

Christie.woodside: /* FDA-ARGOS Database */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

The organisms in the database tables that have FDA-ARGOS tags can be found in our FDA-ARGOS NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 here]. We also have QC'd and included additional organisms that were publicly available on NCBI. They are logged in this page, [[Additional ARGOS Reviewed Organisms]].

* [[Comprehensive Pathogenic Organisms Reference]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6
*Coming Soon: Journal Publication highlighting our QC metrics and data dictionary for the ARGOS Database<br />

Volunteership Spring 2026

2026-01-26T15:47:31Z

Christie.woodside: /* Volunteers (TBD) */

== 2026 Spring Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

January 9, 2026, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

January 12, 2026 | 11:00 AM to 12:00 PM

'''Program Dates: January, 2026 – April, 2026''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Fall 2025|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Maria Kim, Cyrus Yeung, Jeet Vora

# Curate biomarkers for a specific disease or for a treatment
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on NLP/LLM methods.
# Continue working on LLM methods started by volunteers in Fall 2025.
::: The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely (optional)

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see [[Recommended Publications for Intervention Outcome Prediction Models|https://hivelab.biochemistry.gwu.edu/wiki/Recommended_Publications_for_Intervention_Outcome_Prediction_Models]]). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results
Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report or weekly progress reports)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

''Note:'' For anyone interested in ARGOS, you may be assigned to another project of your choice. This project is contingent on a contract extension. Please complete your project selection in order of preference.

POC: Christie Woodside

Qualifications: basic/medium programming skills, knowledgeable of basic bioinformatics platforms and skills.

# Curate and report on currently circulating pathogens to upload to ARGOS
## The student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.
# Report Results
## Defend your pathogens you have selected to be added to the database. Explain their importance and what value they would hold to the scientific community if they were added.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Spring.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer; Urnisha Bhuiyan; Rene Ranzinger
|PredictMod; Glyco web development
|-
|Sampurna Chakravorty
|PredictMod
|Lori Krammer
|PredictMod; ARGOS; BiomarkerKB
|-
|Vishal Muthusekaran*
|BiomarkerKB
|Maria Kim; Cyrus Yeung; Jeet Vora
|BiomarkerKB
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien*]
|PredictMod
|Lori Krammer
|PredictMod
|-
|[https://www.linkedin.com/in/conner-cognata/ Conner Cognata]
|BiomarkerKB
|Maria Kim; Cyrus Yeung; Jeet Vora
|BiomarkerKB; PredictMod; GlyGen biocuration
|-
|Venya Gulati
|ARGOS
|Christie Woodside
|ARGOS; PredictMod; BiomarkerKB
|-
|Isaac Kim
|
|
|PredictMod; GlyGen biocuration; ARGOS
|-
|Yashitha Pobbareddy
|ARGOS
|Christie Woodside
|ARGOS; GlyGen biocuration; BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2026-01-14T16:43:08Z

Christie.woodside: /* 2025 Volunteer Program Details */

== 2025 Volunteer Program Details ==
Click here for Spring 2026 Volunteership: [https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_Spring_2026 Spring 2026 Volunteership]
'''Previous Volunteerships'''
[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|[https://www.linkedin.com/in/arhamur-rauf-2a61b3156 Arhamur Rauf]
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro (Urnisha, Rene, Kate)
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (Christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall & Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-26T20:22:00Z

Christie.woodside: /* Volunteers */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_Spring_2026 Spring 2026 Volunteership]
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|[https://www.linkedin.com/in/arhamur-rauf-2a61b3156 Arhamur Rauf]
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro (Urnisha, Rene, Kate)
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (Christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall & Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-26T17:52:10Z

Christie.woodside: /* Agenda (All times are in Eastern Standard Time) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall and Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Spring 2026

2025-11-14T21:16:17Z

Christie.woodside: /* 4. PredictMod Machine Learning Project Ideas */

== 2026 Spring Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

January 9, 2026, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

January 12, 2026 | 4:00 to 5:00 PM

'''Program Dates: January, 2026 – April, 2026''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Fall 2025|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and weekly 1-2 paragraph reports.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

''Note:'' For anyone interested in ARGOS, you may be assigned to another project of your choice. This project is contingent on a contract extension. Please complete your project selection in order of preference.

POC: Christie Woodside, Jonathon Keeney

Qualifications: basic/medium programming skills, knowledgeable of basic bioinformatics platforms and skills.

# Curate and report on currently circulating pathogens to upload to ARGOS
## The student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.
# Report Results
## Defend your pathogens you have selected to be added to the database. Explain their importance and what value they would hold to the scientific community if they were added.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Spring.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

FDA-ARGOS FAQs

2025-09-19T17:27:42Z

Christie.woodside:

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

===== What is the ArgosDB and how is it organized? =====
ArgosDB was developed as a result of expanded funding for the FDA-ARGOS project, which is described in detail in the [[About Argos DataBase|About]] page of this website. The database stores cross-kingdom QC attributes of clinically relevant organisms organized into respective datasets. The current datasets (as of September 2025) are: ngsQC_ARGOS, ngsQC_ARGOS_extended, assemblyQC_ARGOS, assemblyQC_ARGOS_extended, biosampleMeta_ARGOS, and biosampleMeta_ARGOS_extended. These datasets are associated with core QC protocols, which are documented via Biocompute Objects (BCO) and organized under their BCO IDs.

When new QC data has been produced for an organism of interest, the respective dataset(s) is/are appended, which is documented per data release in the FDA-ARGOS GitHub (<nowiki>https://github.com/FDA-ARGOS</nowiki>) and on the database itself. All datasets are in alignment with the current data dictionary (v1.6.1 as of September 2025), which guides the QC process for that dataset as well as the column headers for a given dataset. Datasets are available as either .tsv or FASTA files if associated with a genome assembly, and all datasets and BCOs are available for download. All data provenance and curation are captured and reproducible via their BCO. Additional datasets include data from the original FDA BioProject, the Data Dictionary, Drug Resistance Mutations, Genome Assemblies (multiple), and a mapping key that assists in linking all the available data via important accessions.

A total of 14 datasets are available as of 09/2025.

===== How can I view or access the previous versions of the data? =====
First, go to the Release History tab on the ARGOS home page. Next, click 'details' on the desired data object. Then select the desired version and data from the version transition dropbox on the top left corner and view metrics such as field count, fields added, fields removed, row count, row count prev, rows count change, ID count, IDs added, and IDs removed.

===== What does the schema version in the datasets refer to? =====
The schema relates to the organization of the ARGOS data within the data model. The version is reflective of the FDA-ARGOS data dictionary version that is currently applied to all updated datasets. As of September 2025, the current schema is v1.6.1 and can be found in the [https://github.com/FDA-ARGOS/data.argosdb/tree/main/data_dictionary/v1.6.1 FDA-ARGOS GitHub].

===== Does Argos have a tutorial on how to use the site? =====
Yes! Please follow the instructions below of how to navigate the DB.

====== How to find and search a dataset ======
On data.argosdb.org home page, you can search for a dataset by entering the keyword in Search Datasets.

Keywords can be BCO ID, organism name, or even a term that describes biological processes. In the following example, three results appear upon the search for Ebola.
To further narrow down the result, select filters on the left sidebar. Alternatively, users can search datasets by selecting relevant filters on the left sidebar.
[[File:Screenshot 2025-09-19 at 1.16.41 PM.png|none|thumb|719x719px|data.argosdb Home Page]]
[[File:Screenshot 2025-09-19 at 1.17.57 PM.png|none|thumb|722x722px|'Ebola' searched in the search bar of the ARGOS database]]

====== How to select a dataset ======
Next, to select a dataset, click on view details under DETAILS. Previous released versioned datasets are available upon clicking the dropdown button
[[File:Dataset.png|thumb|722x722px|view of an example dataset after clicking on the '...view details' link on the homepage. The dropdown menu at the top lets you select data versions.|none]]

====== How to view and download the BCO that is corresponding to its dataset ======
To download the dataset, click on the DOWNLOADS tab and select the download format for the target dataset. BCO JSON will be downloaded and automatically opened as a .txt file upon clicking on Download BCO. Dataset will be downloaded and automatically opened either as .tsv or .csv file upon clicking on Download dataset file.
[[File:Bco.png|thumb|720x720px|BCO JSON tab of the dataset.|none]]
[[File:Screenshot 2025-03-06 at 2.08.02 PM.png|thumb|724x724px|Downloads tab for the dataset. BCO and table can be downloaded here.|none]]

FDA-ARGOS FAQs

2025-09-19T17:18:58Z

Christie.woodside:

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

===== What is the ArgosDB and how is it organized? =====
ArgosDB was developed as a result of expanded funding for the FDA-ARGOS project, which is described in detail in the [[About Argos DataBase|About]] page of this website. The database stores cross-kingdom QC attributes of clinically relevant organisms organized into respective datasets. The current datasets (as of September 2025) are: ngsQC_ARGOS, ngsQC_ARGOS_extended, assemblyQC_ARGOS, assemblyQC_ARGOS_extended, biosampleMeta_ARGOS, and biosampleMeta_ARGOS_extended. These datasets are associated with core QC protocols, which are documented via Biocompute Objects (BCO) and organized under their BCO IDs.

When new QC data has been produced for an organism of interest, the respective dataset(s) is/are appended, which is documented per data release in the FDA-ARGOS GitHub (<nowiki>https://github.com/FDA-ARGOS</nowiki>) and on the database itself. All datasets are in alignment with the current data dictionary (v1.6.1 as of September 2025), which guides the QC process for that dataset as well as the column headers for a given dataset. Datasets are available as either .tsv or FASTA files if associated with a genome assembly, and all datasets and BCOs are available for download. All data provenance and curation are captured and reproducible via their BCO. Additional datasets include data from the original FDA BioProject, the Data Dictionary, Drug Resistance Mutations, Genome Assemblies (multiple), and a mapping key that assists in linking all the available data via important accessions.

A total of 14 datasets are available as of 09/2025.

===== How can I view or access the previous versions of the data? =====
First, go to the Release History tab on the ARGOS home page. Next, click 'details' on the desired data object. Then select the desired version and data from the version transition dropbox on the top left corner and view metrics such as field count, fields added, fields removed, row count, row count prev, rows count change, ID count, IDs added, and IDs removed.

===== What does the schema version in the datasets refer to? =====
The schema relates to the organization of the ARGOS data within the data model. The version is reflective of the FDA-ARGOS data dictionary version that is currently applied to all updated datasets. As of September 2025, the current schema is v1.6.1 and can be found in the [https://github.com/FDA-ARGOS/data.argosdb/tree/main/data_dictionary/v1.6 FDA-ARGOS GitHub].

===== Does Argos have a tutorial on how to use the site? =====
Yes! Please follow the instructions below of how to navigate the DB.

====== How to find and search a dataset ======
On data.argosdb.org home page, you can search for a dataset by entering the keyword in Search Datasets.

Keywords can be BCO ID, organism name, or even a term that describes biological processes. In the following example, three results appear upon the search for Ebola.
To further narrow down the result, select filters on the left sidebar. Alternatively, users can search datasets by selecting relevant filters on the left sidebar.
[[File:Screenshot 2025-09-19 at 1.16.41 PM.png|none|thumb|719x719px|data.argosdb Home Page]]
[[File:Screenshot 2025-09-19 at 1.17.57 PM.png|none|thumb|722x722px|'Ebola' searched in the search bar of the ARGOS database]]

====== How to select a dataset ======
Next, to select a dataset, click on view details under DETAILS. Previous released versioned datasets are available upon clicking the dropdown button
[[File:Dataset.png|thumb|722x722px|view of an example dataset after clicking on the '...view details' link on the homepage. The dropdown menu at the top lets you select data versions.|none]]

====== How to view and download the BCO that is corresponding to its dataset ======
To download the dataset, click on the DOWNLOADS tab and select the download format for the target dataset. BCO JSON will be downloaded and automatically opened as a .txt file upon clicking on Download BCO. Dataset will be downloaded and automatically opened either as .tsv or .csv file upon clicking on Download dataset file.
[[File:Bco.png|thumb|720x720px|BCO JSON tab of the dataset.|none]]
[[File:Screenshot 2025-03-06 at 2.08.02 PM.png|thumb|724x724px|Downloads tab for the dataset. BCO and table can be downloaded here.|none]]

FDA-ARGOS FAQs

2025-09-19T17:18:41Z

Christie.woodside:

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

===== What is the ArgosDB and how is it organized? =====
ArgosDB was developed as a result of expanded funding for the FDA-ARGOS project, which is described in detail in the [[About Argos DataBase|About]] page of this website. The database stores cross-kingdom QC attributes of clinically relevant organisms organized into respective datasets. The current datasets (as of September 2025) are: ngsQC_ARGOS, ngsQC_ARGOS_extended, assemblyQC_ARGOS, assemblyQC_ARGOS_extended, biosampleMeta_ARGOS, and biosampleMeta_ARGOS_extended. These datasets are associated with core QC protocols, which are documented via Biocompute Objects (BCO) and organized under their BCO IDs.

When new QC data has been produced for an organism of interest, the respective dataset(s) is/are appended, which is documented per data release in the FDA-ARGOS GitHub (<nowiki>https://github.com/FDA-ARGOS</nowiki>) and on the database itself. All datasets are in alignment with the current data dictionary (v1.6.1 as of September 2025), which guides the QC process for that dataset as well as the column headers for a given dataset. Datasets are available as either .tsv or FASTA files if associated with a genome assembly, and all datasets and BCOs are available for download. All data provenance and curation are captured and reproducible via their BCO. Additional datasets include data from the original FDA BioProject, the Data Dictionary, Drug Resistance Mutations, Genome Assemblies (multiple), and a mapping key that assists in linking all the available data via important accessions.

A total of 14 datasets are available as of 09/2025.

===== How can I view or access the previous versions of the data? =====
First, go to the Release History tab on the ARGOS home page. Next, click 'details' on the desired data object. Then select the desired version and data from the version transition dropbox on the top left corner and view metrics such as field count, fields added, fields removed, row count, row count prev, rows count change, ID count, IDs added, and IDs removed.

===== What does the schema version in the datasets refer to? =====
The schema relates to the organization of the ARGOS data within the data model. The version is reflective of the FDA-ARGOS data dictionary version that is currently applied to all updated datasets. As of September 2025, the current schema is v1.6.1 and can be found in the [https://github.com/FDA-ARGOS/data.argosdb/tree/main/data_dictionary/v1.6 FDA-ARGOs GitHub].

===== Does Argos have a tutorial on how to use the site? =====
Yes! Please follow the instructions below of how to navigate the DB.

====== How to find and search a dataset ======
On data.argosdb.org home page, you can search for a dataset by entering the keyword in Search Datasets.

Keywords can be BCO ID, organism name, or even a term that describes biological processes. In the following example, three results appear upon the search for Ebola.
To further narrow down the result, select filters on the left sidebar. Alternatively, users can search datasets by selecting relevant filters on the left sidebar.
[[File:Screenshot 2025-09-19 at 1.16.41 PM.png|none|thumb|719x719px|data.argosdb Home Page]]
[[File:Screenshot 2025-09-19 at 1.17.57 PM.png|none|thumb|722x722px|'Ebola' searched in the search bar of the ARGOS database]]

====== How to select a dataset ======
Next, to select a dataset, click on view details under DETAILS. Previous released versioned datasets are available upon clicking the dropdown button
[[File:Dataset.png|thumb|722x722px|view of an example dataset after clicking on the '...view details' link on the homepage. The dropdown menu at the top lets you select data versions.|none]]

====== How to view and download the BCO that is corresponding to its dataset ======
To download the dataset, click on the DOWNLOADS tab and select the download format for the target dataset. BCO JSON will be downloaded and automatically opened as a .txt file upon clicking on Download BCO. Dataset will be downloaded and automatically opened either as .tsv or .csv file upon clicking on Download dataset file.
[[File:Bco.png|thumb|720x720px|BCO JSON tab of the dataset.|none]]
[[File:Screenshot 2025-03-06 at 2.08.02 PM.png|thumb|724x724px|Downloads tab for the dataset. BCO and table can be downloaded here.|none]]

File:Screenshot 2025-09-19 at 1.17.57 PM.png

2025-09-19T17:18:18Z

Christie.woodside:

ebola search image sept 2025

File:Screenshot 2025-09-19 at 1.16.41 PM.png

2025-09-19T17:17:07Z

Christie.woodside:

data.argosdb Home Page 2025

ARGOS Contact Us

2025-09-19T17:11:51Z

Christie.woodside:

To contact us for any inquires,
* email mazumder_lab@gwu.edu
We would love feedback on our website, data, or suggestions on data to include.

Feel free to reach out to us about the data published in data.argosdb.org or any questions you may have.

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS.

ARGOSQC Usage Tutorial

2025-09-19T17:10:02Z

Christie.woodside:

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|thumb|667x667px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. |none]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.
[[File:Screenshot figure 2 a.png|none|thumb|657x657px|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica.]]
[[File:Screenshot fig 2a.png|none|thumb|508x508px|'''Figure 2 a)''' The SRR accession field contains both SRR fastqs for the organism that correspond to the biosample.]]
[[File:Screenshot figure 2b.png|none|thumb|502x502px|'''Figure 2 b)''' The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI. '''a & b)''' '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources. ]]

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
[[File:Screenshot fig 3.png|none|thumb|518x518px|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6). ]]
[[File:Screenshot fig 4.png|none|thumb|670x670px|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page. ]]
[[File:Screenshot fig 5.png|none|thumb|625x625px|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.]]
[[File:Screenshot fig 6.png|none|thumb|634x634px|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.]]
[[File:Screenshot fig 7.png|none|thumb|635x635px|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.]]

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

[[File:Fig8.png|none|thumb|700px|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica. ]]

[[File:Fig9.png|none|thumb|700px|'''Figure 9.''' Pop-up window for SRR HIVE ID selection. ]]

'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Next, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Lastly, navigate to the Advanced tab and open the pipeline section. Under the All objects section, select argos-cflow System Application.

'''Step7:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 8:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

<u>Assembly 1:</u>

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

<u>Assembly 2:</u>

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

<u>SRR IDS:</u> SRR0123456, SRR0123457, SRR0123458, SRR0123459<mark>;</mark> SRR0123451, SRR0123452, SRR0123453, SRR0123454

<u>Assembly IDs:</u> GCA_0011223344.1<mark>;</mark> GCA_0011223345.1

<u>BioSample Accessions:</u> SAMN110654321, SAMN110654322<mark>;</mark> SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.

[[File:Fig.10.png|none|thumb|700px|'''Figure 10.''' Batch mode input settings window.]]

'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

[[File:Fig11.png|none|thumb|700px|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.]]

'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.
'''Step 5:''' Navigate to the Advanced tab and open the Pipeline section. As shown in Figure 12, click on the dropdown arrow next to Pipeline, and under All Projects, locate argos-cflow System Application. As shown in FIgure 13, select argos-cflow System Application by clicking on it, then confirm the selection by clicking Select at the bottom of the window. Leave all other settings in the Advanced section as default.
[[File:Fig12.png|none|thumb|700px|'''Figure 12.''' Advanced tab showing the Pipeline dropdown menu where the argos-cflow System Application can be selected.]]
[[File:Fig13.png|none|thumb|700px|'''Figure 13.''' All Projects window displaying available workflows. The argos-cflow System Application is highlighted and can be selected by clicking “Select” at the bottom. ]]

'''Step 6:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
[[File:Batcheg1.png|none|thumb|500px]]

Assembly 1:

{| style="border:2px solid #e60000; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | SRR0123456, SRR0123457, SRR0123458, SRR0123459
|}

{| style="border:2px solid #ff9900; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | GCA_0011223344.1
|}

{| style="border:2px solid #ffff00; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | SAMN110654321, SAMN110654322
|}
Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

[[File:Batcheg2.png|none|thumb|500px]]

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

{| style="border:2px solid #00ff00; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | SRR0123451, SRR0123452, SRR0123453, SRR0123454
|}

{| style="border:2px solid #00ffff; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | GCA_0011223345.1
|}

{| style="border:2px solid #8000ff; border-collapse:collapse; margin:0 0 5px 0;"
| style="padding:2px 6px;" | SAMN110654323, SAMN110654324
|}

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 14.
[[File:Fig14.png|none|thumb|700px|'''Figure 14.''' The pipeline workflow displayed in the user’s inbox.]]

As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
[[File:Fig15.png|none|thumb|700px|'''Figure 15.''' The available downloads tab and the 5 JSON files that are the QC outputs.]]
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

Projects

2025-09-08T19:21:23Z

Christie.woodside:

{{DISPLAYTITLE:<span style="position: absolute; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
__NOTOC__

<div id="ggw-topbanner" style="clear:both; position:relative; box-sizing:border-box; width:100%; margin:1.2em 0 6px; min-width:47em; border:1px solid #ddd; background-color:#f9f9f9; color:#000;">
<div style="margin:0.4em; text-align:center;">
<div style="font-size:160%; padding:.1em;">Current Projects</div>
</div>
</div>
</div>
<div style="clear: both;"></div>

<div id="ggw_row2" style="display: flex; flex-flow: row wrap; justify-content: space-between; padding: 0; margin: 0 -5px 0 -5px;">
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://hive.biochemistry.gwu.edu/dna.cgi?cmd=main The High-performance Integrated Virtual Environment (HIVE) platform]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
HIVE is a cloud-based environment optimized for the storage and analysis of extra-large data, such as biomedical data, clinical data, next-generation sequencing (NGS) data, mass spectrometry files, confocal microscopy images, post-market surveillance data, medical recall data, and many others. HIVE provides secure web access for authorized users to deposit, retrieve, annotate and compute on Big Data, and analyze the outcomes using web user interfaces. [https://docs.google.com/document/d/1F5iq00uKkJfdSsbwanvKOy-nPnwijH56mwbwa_HhzfY/edit?tab=t.0#heading=h.7dlfmngwfzih More here].

The HIVE platform and associated algorithms such as CensuScope and HIVE-Hexagon is used to support Metgenomics analysis infrastructure.<br><br>
[[GW-HIVE WIKI]]

[[METAGENOMICS WIKI]]

</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://data.argosdb.org/ FDA-ARGOS Project (Food and Drug Administration-dAtabase for Regulatory-Grade micrObial Sequences)]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The FDA-ARGOS Project (Food and Drug Administration-dAtabase for Regulatory-Grade micrObial Sequences) is a collaborative effort to create a high-quality genomic database for identifying and characterizing microbial pathogens. Developed in partnership with the FDA, University of Maryland, and NCBI, the project provides regulatory-grade genomic data, crucial for public health and diagnostic use. Expanded in 2021 with support from GWU, Temple University, and Embleema, FDA-ARGOS aims to enhance infectious disease research through rigorous quality control protocols. The ArgosDB hosts this data, offering downloadable sequences and reproducible workflows for research and regulatory applications.[https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos More here].<br><br>
[[FDA-ARGOS WIKI]]
</div>
</div>

<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://www.biocomputeobject.org/ BioCompute Objects (BCO)]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The BioCompute is FDA funded project to establish a framework for community-based development of standards for harmonization of High-throughput Sequencing (HTS), standardization of data formats, promotion of interoperability, and bioinformatics verification protocols. The BioCompute Object (BCO) was developed in the High-throughput Sequencing Computational Standards for Regulatory Sciences (HTS-CSRS) initiative in the BioCompute Objects Portal (BOP), a web portal to serve as a collaborative ground to encourage a dialogue to facilitate interoperability between different bioinformatic pipelines, industries, and developers. HIVE capabilities have been leveraged to support the development of the BCO. The BCO is versatile and adaptable to other common HTS analysis platforms. [https://docs.google.com/document/d/1WQFZm_PFiQXob4NyOKq6y-2ywnbmNoFHSS27fYf3l4Y/edit?tab=t.0#heading=h.bs8eki17tykx More here].<br><br>
[https://wiki.biocomputeobject.org/Main_Page BIOCOMPUTE OBJECTS WIKI]
</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://www.glygen.org/ GlyGen]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
GlyGen (gly-glycobiology; gen-information), [https://www.glygen.org/<nowiki>] is an advanced glycoinformatics resource developed to facilitate discovery in basic and translational glycobiology research along with enhancing the integration of multidisciplinary information from diverse resources. GlyGen includes knowledge about molecular, biophysical and functional properties of glycans, genes, and proteins organized in pathways and ontologies, plus a rapidly growing body of biological big data related to cancer mutation and expression. GlyGen adopts an innovative user-driven approach for implementing, prioritizing and knowledge disseminating tools to address the questions and needs of glycobiology community. GlyGen is funded by the National Institute of General Medical Sciences under the grant # 1R24GM146616 - 01 and the National Institutes of Health Office of Strategic Coordination - The Common Fund under the grant # 1OT2OD032092. More information about GlyGen - </nowiki>https://www.glygen.org/about/ <br><br>
[https://wiki.glygen.org/Main_Page GlyGen WIKI]
</div>
</div>
</div><div id="ggw_row3" style="display: flex; flex-flow: row wrap; justify-content: space-between; padding: 0; margin: 0 -5px 0 -5px;">
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://hivelab.biochemistry.gwu.edu/predictmod PredictMod]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
PredictMod is an application designed to predict the outcome of an intervention prior to a patient initiating treatment. Our goal is to provide clinicians with a powerful decision making tool that enhances clinical understanding of patient-level data. The PredictMod platform utilizes machine learning tools and complex datasets based on electronic health records, gut microbiome, and -omics data to forecast patient outcomes, often in response to treatment for a particular condition. While our primary condition of interest is Prediabetes, the tool is designed to be used for a variety of conditions, interventions, and data types. <br> <br>
[https://hivelab.biochemistry.gwu.edu/wiki/PredictMod PredictMod WIKI]
</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[[GW-FEAST]]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The GW Federated Ecosystems for Analytics and Standardized Technologies (GW-FEAST) project is part of the ARPA-H FEAST performer team initiative that includes academic and industry partners. The goal of the ARPA-H performer teams is “to create bridges across data silos to make health data more accessible and usable”. <br><br>
[https://hivelab.biochemistry.gwu.edu/wiki/GW-FEAST GW-FEAST WIKI]
</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://hivelab.biochemistry.gwu.edu/biomarker-partnership Biomarker Knowledgebase]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The Biomarker Partnership is a CFDE sponsored project to develop a knowledgebase that will organize and integrate biomarker data from different public sources. The data will be connected to contextual information to show a novel systems-level view of biomarkers. The motivation for this project is to improve the harmonization and organization of biomarker data. This will be done by mapping biomarkers from public sources to, and across, CF data elements. This mapping will bridge knowledge across multiple DCCs and biomedical disciplines.<br><br>
[https://wiki.biomarkerkb.org/Main_Page BioMarkerKB WIKI]
</div>
</div>
</div>
{{DISPLAYTITLE:<span style="position: absolute; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
__NOTOC__

<div id="ggw-topbanner" style="clear:both; position:relative; box-sizing:border-box; width:100%; margin:1.2em 0 6px; min-width:47em; border:1px solid #ddd; background-color:#f9f9f9; color:#000;">
<div style="margin:0.4em; text-align:center;">
<div style="font-size:160%; padding:.1em;">Past Projects</div>
</div>
</div>
</div>
<div style="clear: both;"></div>

<div id="ggw_row2" style="display: flex; flex-flow: row wrap; justify-content: space-between; padding: 0; margin: 0 -5px 0 -5px;">
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://hivelab.tst.biochemistry.gwu.edu/gfkb Gut Microbiome Analytic System (Microbiome)]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The HIVE team received NSF funding to develop a Gut Microbiome Monitoring System (GutFeeling) as a tool which when used over time will allow users to rectify their dietary (such as consumption of probiotics and prebiotics) and other lifestyle habits and to help restore their normal microbiome. Rapid analysis of the large amount of metagenomic data, a major bottleneck, has been resolved by our group through the development of a novel algorithm and accompanying software called CensuScope. Through analysis of healthy gut microbiome data, we are actively developing a Knowledge Base (GutFeelingKB) to provide a clearer picture of not only an ideal personalized microbiome but also establish baseline characteristics for each customer. The Mazumder Lab is collaborating with the Milken School of Public Health and Kamtek Sequencing Facility to investigate the relationship between bacterial species commonly present in the digestive tract, diet, physical activity, lifestyle habits, and metabolic risk factors. [https://docs.google.com/document/d/18WyVTJrrf-FR0sHt634vO8Lwel-4OQxP9sNar7gYYro/edit?tab=t.0#heading=h.7qbm3f7lky31 More here].
</div>
</div>

<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>HIVE-EQAPOL Project on HIVE NGS Data Processing and Analysis</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
For this project, our group works closely with the External Quality Assurance Program Oversight Laboratory (EQAPOL) team to conduct HIV NGS data analysis and collaborate in terms of analyzing, storing, and tracking HIV NGS Data. Reliable identification of strains is critical for developing new assays, validating assay platforms, assisting regulators to evaluate test kits, monitoring HIV drug resistance, and informing vaccine development. The HIVE tools and platform are used for virus identification, recombination analysis, and clone discovery.
</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://www.oncomx.org/ OncoMX]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
The OncoMX mission is to create an integrated cancer mutation and expression resource for exploring cancer biomarkers. OncoMX is a collaboration between the George Washington University (GW), NASA's Jet Propulsion Laboratory (JPL), the Swiss Institute of Bioinformatics (SIB), and the University of Delaware (UD). The core knowledgebase of OncoMX is derived from BioMuta and BioXpress integrated cancer mutation and expression databases which are actively maintained. Normal expression data from Bgee and custom text mining software augment the cancer data to improve functional interpretation of the reported variants and expression profiles. All data are wrapped into the OncoMX database and web portal, mapped to additional functional information from NCI Early Detection Research Network (EDRN) and Reactome. It is expected that the large-scale integration of cancer data and supporting information, provided by OncoMX with direct community feedback, will benefit cancer research by improving synthesis of information and may make earlier detection a reality.
</div>
</div>
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[https://hive.biochemistry.gwu.edu/dna.cgi?cmd=main Glycoproteomics Characterization Workflow and Data-Analysis Pipeline for Vaccines and Biosimilars]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
In this FDA funded project we are extending High-performance Integrated Virtual Environment (HIVE) capabilities through the development and integration of software tools and datasets for comparative analysis of glycoproteins. Glycomic analysis has many angles and has been extensively reviewed in recent literature. We propose to rely on the independent development of the glycomics field and incorporate these approaches in the HIVE pipeline as they mature while we develop a standardized glycoinformatics pipeline that will benefit investigators and regulators at the FDA.
</div>
</div>

</div>
{{DISPLAYTITLE:<span style="position: absolute; clip: rect(1px 1px 1px 1px); clip: rect(1px, 1px, 1px, 1px);">{{FULLPAGENAME}}</span>}}
__NOTOC__

<div id="ggw-topbanner" style="clear:both; position:relative; box-sizing:border-box; width:100%; margin:1.2em 0 6px; min-width:47em; border:1px solid #ddd; background-color:#f9f9f9; color:#000;">
<div style="margin:0.4em; text-align:center;">
<div style="font-size:160%; padding:.1em;">RESOURCES</div>
</div>
</div>
</div>
<div style="clear: both;"></div>

<div style="clear: both;">
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;"><div id="ggw_row3" style="display: flex; flex-flow: row wrap; justify-content: space-between; padding: 0; margin: 0 -5px 0 -5px;">
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[[Tool Resources]]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
    '''''Main article:''' [[Tool Resources]]''<br>There are a variety of bioinformatic tool resources developed by our team.
<br />
</div>
</div>
</div>
</div>

<div style="clear: both;">
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;"><div id="ggw_row3" style="display: flex; flex-flow: row wrap; justify-content: space-between; padding: 0; margin: 0 -5px 0 -5px;">
<div style="flex: 1; margin: 5px; min-width: 210px; border: 1px solid #CCC; padding: 0 10px 10px 10px; box-shadow: 0 2px 2px rgba(0,0,0,0.1); background: #f5faff;">
<h3>[[Dataset Resources]]</h3>
<div style="border-top: 1px solid #CCC; padding-top: 0.5em;">
    '''''Main article:''' [[Dataset Resources]]''<br>There are a variety of bioinformatic dataset resources integrated by our team.
</div>
</div>
</div>
</div>

FDA-ARGOS WIKI

2025-09-08T17:19:14Z

Christie.woodside: /* Project Publications */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6
*Coming Soon: Journal Publication highlighting our QC metrics and data dictionary for the ARGOS Database<br />

FDA-ARGOS WIKI

2025-09-08T17:15:58Z

Christie.woodside: /* FDA-ARGOS Database */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here]. This is a static webpage that will be updated with each push.

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

FDA-ARGOS WIKI

2025-09-08T17:15:14Z

Christie.woodside: /* ARGOS Usage Tutorials and Protocols */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here].

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

The ARGOS One-Click Pipeline is used to create the QC metrics and results displayed in our data tables on data.argosdb.org. The ARGOS One-Click Pipeline Usage tutorial can be found [[ARGOSQC Usage Tutorial|here]].
== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

FDA-ARGOS WIKI

2025-09-08T17:13:20Z

Christie.woodside:

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here].

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== ARGOS Usage Tutorials and Protocols ==

== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

FDA-ARGOS WIKI

2025-09-08T17:12:38Z

Christie.woodside:

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here].

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

ARGOS Contact Us

2025-09-08T17:11:49Z

Christie.woodside:

To contact us for any inquires,
* email mazumder_lab@gwu.edu
Feel free to reach out to us about the data published in data.argosdb.org or any questions you may have.

To return to the ARGOS Main Page, [[FDA-ARGOS WIKI|click here]].

FDA-ARGOS WIKI

2025-09-08T17:10:44Z

Christie.woodside: /* Introduction */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information from FDA Announcements]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here].

== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

Volunteership 2025

2025-08-29T14:12:03Z

Christie.woodside: /* Volunteers */

For Fall opportunities, [[Volunteership Fall 2025|click here to view our Fall 2025 Volunteership Program]].<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<strong>Volunteer Zoom Kick-Off Meeting</strong><br>
May 27, 2025 | 3:30 to 4:30 PM

<strong>Program Dates: June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood, Maria Kim
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Curation:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.

Modeling & Integration (for those with experience in programming/ML)

# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.
# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>See Symposium Slides Guidelines below</li></ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
!Project
!Projects Interested
|-
| [https://www.linkedin.com/in/gracesjchong/ Grace Chong]
|PredictMod
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/alma-ogunsina-4959072b1/ Alma Ogunsina]
|Biomarker curation
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy]
|PredictMod
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/harivinay-prasad-reddy-gujjula-a06ba71bb/ Harivinay P. Gujjula]
|GlyGen curation
|
# GlyGen Biocuration
# BioMarkerKB Biocuration
|-
|'''Arhamur Rauf'''
|ARGOS
|
|-
|[https://www.linkedin.com/in/miao-wang-88b602290/Miao Wang Miao Wang]
|ARGOS
|
# BiomarkerKB Biocuration Project Ideas
# FDA-ARGOS Computation and Pathogen Curation Project
# PredictMod Machine Learning Project Ideas
|-
|[https://www.linkedin.com/in/nahom-gebreselassie-1545ab336/ Nahom Abel]
|GlyGen curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# PredictMod
|-
|[https://www.linkedin.com/in/kajal-patel-cs/ Kajal Sanjaykumar Patel]
|GlyGen and PubMed project
|
#PredictMod
#BiomarkerKB
#GlyGen
|-
|[https://www.linkedin.com/in/john-mccaffrey-b8850930a/ John McCaffrey]
|Biomarker curation
|
# PredictMod
# BiomarkerKB
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/nathan-ressom/ Nathan Ressom]
|BiomarkerKB
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB Biocuration
|-
|[https://www.linkedin.com/in/aaron-ressom/ Aaron Ressom]
|PredictMod
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/akale-kinfe/ Akale Kinfe]
|Biomarker curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# ARGOS
|-
|[https://www.linkedin.com/in/aise-arpinar-a8bb9b373/?original_referer= Aise Arpinar]
|GlyGen curation
|
# GlyGen Biocuration
# BiomarkerKB Biocuration
# GlyGen Publication Analysis
|-
|[https://www.linkedin.com/in/piyush-pandey-906b582b5/ Piyush Pandey]
|Biomarker curation
|
# BiomarkerKB Biocuration
# PredictMod
# GlyGen Biocuration
|-
|[http://www.linkedin.com/in/filmawit-zeru-203272363 Filmawit Zeru]
|GlycoSiteMiner project
|
# BiomarkerKB
# GlyGen
# ARGOS
|-
|[https://www.linkedin.com/in/mathias-belay-03b51a2a3/ Mathias Belay]
|Biomarker curation
|
# GlyGen
# PredictMod
# BiomarkerKB
|-
|[https://www.linkedin.com/in/isaac-kim-b644bb231/ Isaac Kim]
|Biomarker curation
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/sohana-bahl-6549a2376/ Sohana Bahl]
|Biomarker curation
|
# BiomarkerKB
|-
|[https://www.linkedin.com/in/ana-vohralikova-794a4433a?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app Ana Vohralikova]
|Biomarker curation
|
# BiomarkerKB Biocuration Project
# GlyGen Biocuration Project
# FDA-ARGOS Computation and Pathogen
|}

<hr>

=== Symposium Slide Guidelines ===
'''Content Clarity'''

   •   '''Keep It Simple:''' Use concise bullet points instead of long paragraphs. Aim for no more than 6 bullet points per slide.

•    '''Focus on Key Points:''' Highlight the main ideas or data you want your audience to remember.

•    '''Consistent Layout:''' Use a consistent layout for each slide, including fonts, colors, and background. This helps maintain a professional look.

•    '''High-Quality Images:''' Use high-resolution images and graphics to illustrate your points. Avoid using clip art.

•    '''Readable Fonts:''' Use easy-to-read fonts (e.g., Arial, Calibri) and ensure font sizes are large enough to be seen from a distance (24 pt or larger for main text).

•    '''Contrast:''' Ensure there is high contrast between text and background (e.g., dark text on a light background).

•    '''Citation:''' Cite a publication to support the information presented in proper citation format.

'''Outline for Symposium presentation'''

1. Introduction:

2. Project Descriptions

3. Objectives and Goals:

4. Methods, Results, Achievements and Contributions:

5. Future Plans:

6. Skills and Knowledge Gained:

7. Acknowledgments:

8. Q&A Session:

'''Outline'''

'''1. Introduction:''' (1 slide)

- Briefly introduce yourself.

- Add your picture and name on the introduction slide. If it is group add the group picture.

'''2. Project Descriptions:''' (1 slide)

- Provide context and background information about the project.

'''3. Project Objectives and Goals:''' (1 slide)

- Describe the main objectives of the project or initiative.

- Discuss any additional goals or desired outcomes.

- Explain why these objectives and goals are important.

'''4. Methods, Results, Achievements and Contributions:'''

- Highlight the methods/tools used in the project.

- Highlight the key results and outcomes of the project.

- Discuss the most significant achievements and milestones reached.

- Explain how each member of the team project contributed to the project (for group project)

''' '''

'''5. Future Plans'''

- Next steps or future plans for the project

'''6. Skills and Knowledge Gained:''' (1 slide)

- Detail any technical skills acquired or improved.

- Highlight any soft skills, such as communication or teamwork, that were developed.

- Discuss new knowledge gained in specific areas or subjects.

- Share any personal reflections on the experience and what was learned.

- Discuss any challenges or obstacles encountered and how they were overcome.

- Provide key insights or lessons learned from the project.

'''7. Acknowledgments:''' ''':''' (1 slide)

- Acknowledge the contributions of team members and collaborators.

- Recognize the guidance and support of mentors and advisors.

- Acknowledge the Project Funding. Eg. CFDE

'''8. Q&A Session:'''

- Invite the audience to ask questions and engage in discussion.

- Provide clear and thoughtful responses to audience questions.

- Offer closing remarks and thank the audience for their participation.

Note – If you have limited presentation time you can also merge few topics into one.

Volunteership Fall 2025

2025-08-20T18:18:22Z

Christie.woodside: /* 4. PredictMod Machine Learning Project Ideas */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast, and liver cancer, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|
|
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|
|GlyGen
|-
|Akale Kinfe
|
|
|BiomarkerKB
|-
|Nahom Abel
|
|
|BiomarkerKB
|-
|Harivinay P. Gujjula
|
|
|GlyGen
|-
|Sparsh Gupta
|
|
|BiomarkerKB
|-
|Mathias Belay
|
|
|BiomarkerKB
|-
|Isil Erbasol Serbes
|
|
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|
|
|PredictMod
|-
|Anagha Kalle
|
|
|PredictMod
|-
|Vishal Muthusekaran
|
|
|BiomarkerKB
|-
|Adonay Awet
|
|
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-20T18:15:48Z

Christie.woodside: /* Volunteers (TBD) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast, and liver cancer, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|
|
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|
|GlyGen
|-
|Akale Kinfe
|
|
|BiomarkerKB
|-
|Nahom Abel
|
|
|BiomarkerKB
|-
|Harivinay P. Gujjula
|
|
|GlyGen
|-
|Sparsh Gupta
|
|
|BiomarkerKB
|-
|Mathias Belay
|
|
|BiomarkerKB
|-
|Isil Erbasol Serbes
|
|
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|
|
|PredictMod
|-
|Anagha Kalle
|
|
|PredictMod
|-
|Vishal Muthusekaran
|
|
|BiomarkerKB
|-
|Adonay Awet
|
|
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-06T14:36:10Z

Christie.woodside: /* Dates */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Pat McNeeley

PMID Curation:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project
!Projects Interested
|-
|''Anika Sikka (tentative)''
|
|
#GlyGen
|-
|''Akale Kinfe''
|
|
# BiomarkerKB
|}

Symposium 2025

2025-07-17T19:15:40Z

Christie.woodside: /* Agenda */

The HIVE Lab summer symposium is scheduled for Thursday July 31, 2025. It is an exciting time for the lab volunteers and interns to present their findings on the projects they worked on for 8 weeks.

[[File:DC.png|center|frame]]

== '''Program and Information''' ==

=== '''Symposium Venue''' ===
The HIVE lab symposium will held in person at The George Washington University, Washington DC with an option to join virtually.

In Person - Ross 647, Ross Hall, School of Health and Medical Sciences, The George Washington University, Washington DC ([https://maps.app.goo.gl/PHQmZacA4hWDvTCh6 MAP])

Virtual - Zoom

== '''Agenda''' ==
All times in Eastern Standard Time
{| class="wikitable"
|'''Time (ET)'''
|'''Project'''
|'''Title'''
|'''Presenter'''
|-
|'''10:00am'''
| colspan="2" | '''Welcome and Introduction'''
|'''Michael Tiemeyer (10 min)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|10:10am
|CFDE
|Integrating Biocuration and Data Standardization to Generate Machine Learning-Ready Glycan Datasets
|Ana Jaramillo and Yuxin Zou (20 min)
|-
|10:30am
|CFDE
|
|Campbell Ross (15 min)
|-
|10:45am
|CFDE
|A Graph-Based AI Workflow for Mining Glycan Biomarkers and Related Annotations from Publications
|Cyrus Chun Hong Au Yeung (15 min)
|-
|11:00am
|BiomarkerKB
|
|Sohana Bahl, Isaac Kim, Sparsh Gupta (15 min)
|-
|11:15am
|BiomarkerKB
|
|Nathan Ressom, Ana Vohralikova, Mathias Belay (15 min)
|-
|11:30am
|BiomarkerKB
|
|John McCaffery, Alma Ogunsina, Akale Kinfe (15 min)
|-
|'''11:45am'''
| colspan="2" |'''Open Q and A'''
|'''All (30 min)'''
|-
|12:30pm
| colspan="3" | '''LUNCH (90 mins)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|2:00pm
|Predictmod AI-READI
|Robust Classification of Glycemic Health States from Continuous Glucose
|Nikhil Arethiya (15 min)
|-
|2:15pm
|Predictmod Curation
|PredictMod: PubMed Curation for Training an LLM for Recommendation
|Grace Chong, Aaron Ressom, Diya Kamalabharathy (15 min)
|-
|2:30pm
|Argos
|Curation of Emerging Pathogen Genomes for FDA-ARGOS Database Expansion
|Miao Wang (15 min)
|-
|2:45pm
|GlyGen
|GlyGen Biocuration Project
|Aise Arpinar, Haravinay P. Gujjulla, Nahom Abel (20 min)
|-
|3:05pm
|GlycoSiteMiner
|
|(15 min)
|-
|3:20pm
|Glycobiology Web Development
|A Resource Drill Down and Visualization for the Glyspace Alliance
|Diya Kamalabharathy (5 min)
|-
|'''3:25pm'''
| colspan="2" |'''Open Q and A'''
|'''All (20 min)'''
|-
|3:45pm
| colspan="2" | '''Closing Remarks'''
|'''Raja Mazumder'''
|}

== '''Project Description''' ==

=== GlyGen Project ===
The GlyGen Biocuration project focuses on integrating legacy, yet valuable, data from the CarbBank and CFG databases into the GlyGen infrastructure. A key challenge is mapping metadata, such as species names and publication references, to standardized dictionaries and ontologies. While most entries have been automatically matched using custom scripts, remaining inconsistencies, including outdated, misspelled, or abbreviated terms, require manual curation using resources such as Google, PubMed, and domain-specific dictionaries and ontologies.

=== BiomarkerKB Biocuration Project ===
The Biomarker Biocuration project focuses on biomarker curation from abstracts and publications in the BiomarkerKB data model. A key challenge in curating biomarkers is the vast amount of data that is present over various publications. Manual curation requires reading, inferring, and understanding key elements of biomarker data and being able to map it to the defined biomarker data model. LLM methodologies will help immensely in being able to recognize biomarker and condition data and being able to map information found into the data model while also automatically mapping other contextual and standardized data to the model to allow data to be AI andmachine leanring ready.

=== ArgosDB Curation Project ===
This project focuses on evaluating and curating high-quality genomes of emerging and clinically relevant pathogens, with an emphasis on fungal species. Using public genomic repositories and FDA-ARGOS inclusion criteria, I identify candidate organisms for database expansion to support diagnostic assay development and public health surveillance.

Symposium 2025

2025-07-17T18:47:19Z

Christie.woodside: /* Project Description */

The HIVE Lab summer symposium is scheduled for Thursday July 31, 2025. It is an exciting time for the lab volunteers and interns to present their findings on the projects they worked on for 8 weeks.

[[File:DC.png|center|frame]]

== '''Program and Information''' ==

=== '''Symposium Venue''' ===
The HIVE lab symposium will held in person at The George Washington University, Washington DC with an option to join virtually.

In Person - Ross 647, Ross Hall, School of Health and Medical Sciences, The George Washington University, Washington DC ([https://maps.app.goo.gl/PHQmZacA4hWDvTCh6 MAP])

Virtual - Zoom

== '''Agenda''' ==
All times in Eastern Standard Time
{| class="wikitable"
|'''Time (ET)'''
|'''Project'''
|'''Title'''
|'''Presenter'''
|-
|'''10:00am'''
| colspan="2" | '''Welcome and Introduction'''
|'''Michael Tiemeyer (10 min)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|10:10am
|CFDE
|Integrating Biocuration and Data Standardization to Generate Machine Learning-Ready Glycan Datasets
|Ana Jaramillo and Yuxin Zou (20 min)
|-
|10:30am
|CFDE
|
|Campbell Ross (15 min)
|-
|10:45am
|CFDE
|A Graph-Based AI Workflow for Mining Glycan Biomarkers and Related Annotations from Publications
|Cyrus Chun Hong Au Yeung (15 min)
|-
|11:00am
|BiomarkerKB
|
|Sohana Bahl, Isaac Kim, Sparsh Gupta (15 min)
|-
|11:15am
|BiomarkerKB
|
|Nathan Ressom, Ana Vohralikova, Mathias Belay (15 min)
|-
|11:30am
|BiomarkerKB
|
|John McCaffery, Alma Ogunsina, Akale Kinfe (15 min)
|-
|'''11:45am'''
| colspan="2" |'''Open Q and A'''
|'''All (30 min)'''
|-
|12:30pm
| colspan="3" | '''LUNCH (90 mins)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|2:00pm
|Predictmod AI-READI
|Robust Classification of Glycemic Health States from Continuous Glucose
|Nikhil Arethiya (15 min)
|-
|2:15pm
|Predictmod Curation
|PredictMod: PubMed Curation for Training an LLM for Recommendation
|Grace Chong, Aaron Ressom, Diya Kamalabharathy (15 min)
|-
|2:30pm
|Argos
|Curation of Emerging Pathogen Genomes for FDA-ARGOS Database Expansion
|(15 min)
|-
|2:45pm
|GlyGen
|GlyGen Biocuration Project
|Aise Arpinar, Haravinay P. Gujjulla, Nahom Abel (20 min)
|-
|3:05pm
|GlycoSiteMiner
|
|(15 min)
|-
|3:20pm
|Glycobiology Web Development
|A Resource Drill Down and Visualization for the Glyspace Alliance
|Diya Kamalabharathy (5 min)
|-
|'''3:25pm'''
| colspan="2" |'''Open Q and A'''
|'''All (20 min)'''
|-
|3:45pm
| colspan="2" | '''Closing Remarks'''
|'''Raja Mazumder'''
|}

== '''Project Description''' ==

=== GlyGen Project ===
The GlyGen Biocuration project focuses on integrating legacy, yet valuable, data from the CarbBank and CFG databases into the GlyGen infrastructure. A key challenge is mapping metadata, such as species names and publication references, to standardized dictionaries and ontologies. While most entries have been automatically matched using custom scripts, remaining inconsistencies, including outdated, misspelled, or abbreviated terms, require manual curation using resources such as Google, PubMed, and domain-specific dictionaries and ontologies.

=== BiomarkerKB Biocuration Project ===
The Biomarker Biocuration project focuses on biomarker curation from abstracts and publications in the BiomarkerKB data model. A key challenge in curating biomarkers is the vast amount of data that is present over various publications. Manual curation requires reading, inferring, and understanding key elements of biomarker data and being able to map it to the defined biomarker data model. LLM methodologies will help immensely in being able to recognize biomarker and condition data and being able to map information found into the data model while also automatically mapping other contextual and standardized data to the model to allow data to be AI andmachine leanring ready.

=== ArgosDB Curation Project ===
This project focuses on evaluating and curating high-quality genomes of emerging and clinically relevant pathogens, with an emphasis on fungal species. Using public genomic repositories and FDA-ARGOS inclusion criteria, I identify candidate organisms for database expansion to support diagnostic assay development and public health surveillance.

Additional ARGOS Reviewed Organisms

2025-06-05T16:15:16Z

Christie.woodside:

Back to the [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

This page highlights organisms that were selected, reviewed, and added to ARGOSDB that are external to the FDA-ARGOS BioProject PRJNA231221. The QC and metadata information for these organisms can be found in these tables on data.argosdb.org:

- assemblyQC_ARGOS

- biosampleMeta_ARGOS

- ngsQC_ARGOS

==Organism List==
The organisms listed below can also be found on the [https://data.argosdb.org/ARGOS_000018 ngs_id_list.tsv table] on data.argosdb.org.

'''Organisms that have been QC'd and reviewed:'''

- Influenza A (H3N2) 2022

- Influenza A (H1N1) 2022

- Influenza A (Puerto Rico H1N1) reference genome

- Monkeypox Virus (WRAIR 761)

- Monkeypox Virus (USA2003(

- Human immunodeficiency virus type 1

- Orthomarburgvirus marburgense (isolate Ravn virus/H.sapiens-tc/KEN/1987/Kitum Cave-810040)

- Lake Victoria marburgvirus (Angola2005 strain Ang1379c)

- Marburg virus (Musoke Kenya 1980 for SAMN11077998)

- Marburg virus (Musoke Kenya 1980 for SAMN16357613)

- Lake Victoria marburgvirus (Ci67)

- Sudan ebolavirus (Gulu)

- Severe acute respiratory syndrome coronavirus 2 (Omicron/XBB.1.5 SARS-CoV-2/human/USA/CA-CDC-STM-S85G6U7MH/2023)

- Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2/human/USA/WA-CDC-02982586-001/2020)

- Severe acute respiratory syndrome coronavirus 2 (isolate Wuhan-Hu-1)

- Candidozyma auris

'''Organisms that were selected, but have not been QC'd due to missing data information:'''

- Middle East respiratory syndrome related coronavirus (EMC/2012)

- Middle East respiratory syndrome related coronavirus (VR-3248SD)

- Middle East respiratory syndrome related coronavirus (IRF0021 MERS JOR)

- Middle East respiratory syndrome related coronavirus (IRF0038 MERS EMC)

- Severe acute respiratory syndrome coronavirus 2 (Delta/B.1.617.2)

- Severe acute respiratory syndrome coronavirus 2 (Omicron/B.1.1.529)

- Severe acute respiratory syndrome coronavirus 2 (Beta/B.1.351)

- Severe acute respiratory syndrome coronavirus 2 (Alpha/B.1.1.7)

- Severe acute respiratory syndrome coronavirus 2 (Gamma/P.1)

FDA-ARGOS FAQs

2025-06-05T16:13:32Z

Christie.woodside:

Back to [[FDA-ARGOS WIKI|Home Page]] for FDA-ARGOS

===== What is the ArgosDB and how is it organized? =====
ArgosDB was developed as a result of expanded funding for the FDA-ARGOS project, which is described in detail in the [[About Argos DataBase|About]] page of this website. The database stores cross-kingdom QC attributes of clinically relevant organisms organized into respective datasets. The current datasets (as of March 2025) are: ngsQC_ARGOS, ngsQC_ARGOS_unreviewed, assemblyQC_ARGOS, assemblyQC_ARGOS_unreviewed, biosampleMeta_ARGOS, and biosampleMeta_ARGOS unreviewed. The original four key datasets are ngsQC_HIVE, assemblyQC_HIVE, siteQC_HIVE, and biosampleMeta_HIVE. These datasets are associated with core QC protocols, which are documented via Biocompute Objects (BCO) and organized under their BCO IDs.

When new QC data has been produced per an organism of interest, the respective dataset(s) is/are appended, which is documented per data release in the FDA-ARGOS GitHub (<nowiki>https://github.com/FDA-ARGOS</nowiki>). All datasets are in alignment with the current data dictionary (v1.6 as of March 2025), which guides the QC process for that dataset as well as the column headers for a given dataset. Datasets are available as either .tsv or FASTA files if associated with a genome assembly, and all datasets and BCOs are available for download. All data provenance and curation are captured and reproducible via their BCO. Additional datasets include data from the original FDA BioProject, the Data Dictionary, Drug Resistance Mutations, Genome Assemblies (multiple), and a mapping key that assists in linking all the available data via important accessions.

A total of 24 datasets are available as of 03/2025.

===== How can I view or access the previous versions of the data? =====
First, go to the Release History tab on the ARGOS home page. Next, click details on the desired data object. Then select the desired version and data from the version transition dropbox on the top left corner and view metrics such as field count, fields added, fields removed, row count, row count prev, rows count change, ID count, IDs added, and IDs removed.

===== What does the schema version in the datasets refer to? =====
The schema relates to the organization of the ARGOS data within the data model. The version is reflective of the FDA-ARGOS data dictionary version that is currently applied to all updated datasets. As of March 2025, the current schema is v1.6 and can be found in the [https://github.com/FDA-ARGOS/data.argosdb/tree/main/data_dictionary/v1.6 FDA-ARGOs GitHub].

===== Does Argos have a tutorial on how to use the site? =====
Yes! Please follow the below basic instructions of how to navigate the DB:

====== How to find and search a dataset ======
On data.argosdb.org home page, you can search a dataset by entering the keyword in Search Datasets.

Keywords can be BCO ID, organism name, or even a term that describes biological processes. In the following example, three results appear upon the search for ebola.
To further narrow down the result, select filters on the left side bar. Alternatively, users can search datasets by selecting relevant filters on the left side bar.[[File:Screenshot 2025-03-06 at 1.55.51 PM.png|thumb|719x719px|data.argosdb Home Page|none]][[File:Ebola search.png|thumb|722x722px|'ebola' searched in the search bar of the ARGOS database|none]]

====== How to select a dataset ======
Next, to select a dataset, click on view details under DETAILS. Previous released versioned datasets are available upon clicking the dropdown button
[[File:Dataset.png|thumb|722x722px|view of an example dataset after clicking on the '...view details' link on the homepage. The dropdown menu at the top lets you select data versions.|none]]

====== How to view and download the BCO that is corresponding to its dataset ======
To download the dataset, click on the DOWNLOADS tab and select the download format for the target dataset. BCO JSON will be downloaded and automatically opened as a .txt file upon clicking on Download BCO. Dataset will be downloaded and automatically opened either as .tsv or .csv file upon clicking on Download dataset file.
[[File:Bco.png|thumb|720x720px|BCO JSON tab of the dataset.|none]]
[[File:Screenshot 2025-03-06 at 2.08.02 PM.png|thumb|724x724px|Downloads tab for the dataset. BCO and table can be downloaded here.|none]]

Additional ARGOS Reviewed Organisms

2025-06-02T15:25:39Z

Christie.woodside:

This page highlights organisms that were selected, reviewed, and added to ARGOSDB that are external to the FDA-ARGOS BioProject PRJNA231221. The QC and metadata information for these organisms can be found in these tables on data.argosdb.org:

- assemblyQC_ARGOS

- biosampleMeta_ARGOS

- ngsQC_ARGOS

==Organism List==
The organisms listed below can also be found on the [https://data.argosdb.org/ARGOS_000018 ngs_id_list.tsv table] on data.argosdb.org.

'''Organisms that have been QC'd and reviewed:'''

- Influenza A (H3N2) 2022

- Influenza A (H1N1) 2022

- Influenza A (Puerto Rico H1N1) reference genome

- Monkeypox Virus (WRAIR 761)

- Monkeypox Virus (USA2003(

- Human immunodeficiency virus type 1

- Orthomarburgvirus marburgense (isolate Ravn virus/H.sapiens-tc/KEN/1987/Kitum Cave-810040)

- Lake Victoria marburgvirus (Angola2005 strain Ang1379c)

- Marburg virus (Musoke Kenya 1980 for SAMN11077998)

- Marburg virus (Musoke Kenya 1980 for SAMN16357613)

- Lake Victoria marburgvirus (Ci67)

- Sudan ebolavirus (Gulu)

- Severe acute respiratory syndrome coronavirus 2 (Omicron/XBB.1.5 SARS-CoV-2/human/USA/CA-CDC-STM-S85G6U7MH/2023)

- Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2/human/USA/WA-CDC-02982586-001/2020)

- Severe acute respiratory syndrome coronavirus 2 (isolate Wuhan-Hu-1)

- Candidozyma auris

'''Organisms that were selected, but have not been QC'd due to missing data information:'''

- Middle East respiratory syndrome related coronavirus (EMC/2012)

- Middle East respiratory syndrome related coronavirus (VR-3248SD)

- Middle East respiratory syndrome related coronavirus (IRF0021 MERS JOR)

- Middle East respiratory syndrome related coronavirus (IRF0038 MERS EMC)

- Severe acute respiratory syndrome coronavirus 2 (Delta/B.1.617.2)

- Severe acute respiratory syndrome coronavirus 2 (Omicron/B.1.1.529)

- Severe acute respiratory syndrome coronavirus 2 (Beta/B.1.351)

- Severe acute respiratory syndrome coronavirus 2 (Alpha/B.1.1.7)

- Severe acute respiratory syndrome coronavirus 2 (Gamma/P.1)

FDA-ARGOS WIKI

2025-05-14T15:53:08Z

Christie.woodside: /* FDA-ARGOS Database */

== Introduction ==

FDA-ARGOS database updates may help researchers rapidly validate diagnostic tests and use qualified genetic sequences to support future product development. NCBI BioProject [https://www.ncbi.nlm.nih.gov/bioproject/231221 PRJNA231221]

As of September 2021, Embleema and George Washington University have been conducting bioinformatic research and system development, focusing on expanding the FDA-ARGOS database. This project expands datasets publicly available in FDA-ARGOS, improves quality control by developing quality matrix tools and scoring approaches that will allow the mining of public sequence databases, and identifies high-quality sequences for upload to the FDA-ARGOS database as regulatory-grade sequences. Building on expansions during the COVID-19 pandemic, this project aims to further improve the utility of the FDA-ARGOS database as a key tool for medical countermeasure development and validation.

For additional details on project information and assembly QC see

* [https://www.fda.gov/emergency-preparedness-and-response/preparedness-research/expanding-next-generation-sequencing-tools-support-pandemic-preparedness-and-response FDA-ARGOS Project Information]
* [https://data.argosdb.org/ ARGOS Database]
* [[About Argos DataBase]]
* [https://www.fda.gov/medical-devices/science-and-research-medical-devices/database-reference-grade-microbial-sequences-fda-argos FDA Statement for FDA-ARGOS Database]
* [[ARGOS Contact Us]]

'''FDA-ARGOS Initial Phase'''

In May 2014, the FDA and collaborators had established a publicly available database for Reference Grade microbial Sequences called FDA-ARGOS. With funding support from FDA's Office of Counterterrorism and Emerging Threats (OCET) and DoD, the FDA-ARGOS team had initially collected and sequenced 2000 microbes that included biothreat microorganisms, common clinical pathogens, and closely related species. At the beginning of this project, the FDA-ARGOS microbial genomes were generated in 3 phases. Generally:

* Phase 1, entailed collection of a previously identified microbe and nucleic acid extraction.
* Phase 2, the microbial nucleic acids were then sequenced and de novo assembled occurred using Illumina and PacBio sequencing platforms at the Institute for Genome Sciences at the University of Maryland (UMD-IGS).
* Phase 3, the assembled genomes were then vetted by an ID-NGS subject matter expert working group that consisted of FDA personnel and collaborators and the data was then deposited in the NCBI databases.

The FDA-ARGOS genomes meet the quality metrics for reference-grade genomes for regulatory use. FDA-ARGOS reference genomes have been de novo assembled with high depth of base coverage and placed within a pre-established phylogenetic tree. Each microbial isolate in the database is covered at a minimum of 20X over 95 percent of the assembled core genome. Furthermore, sample-specific metadata, raw reads, assemblies, annotation, and details of the bioinformatics pipeline are available.

== FDA-ARGOS Database ==

The FDA-ARGOS database ([https://data.argosdb.org/ data.argosdb.org]) is a public database containing quality-controlled reference genomes for diagnostic and regulatory purposes.

* FDA-ARGOS NCBI BioProject can be found [https://www.ncbi.nlm.nih.gov/bioproject/231221 here].
* [[Additional ARGOS Reviewed Organisms]].

For a visual display of data analytics, view the prototype dashboard [https://studyanalytics.embleema.com/superset/dashboard/argos/?standalone=2&expand_filters=0 here].

== Project Publications ==

*Sichtig, H., Minogue, T., Yan, Y. ''et al.'' FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. ''Nat Commun'' 10, 3313 (2019). https://doi.org/10.1038/s41467-019-11306-6<br />

== FDA-ARGOS FAQs ==
Frequently asked questions about FDA-ARGOS can be found [[FDA-ARGOS FAQs|here]]. If there are any further questions, feel free to [[ARGOS Contact Us|contact us]].

Volunteership 2025

2025-05-13T15:48:11Z

Christie.woodside:

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<strong>Volunteer Zoom Kick-Off Meeting</strong><br>
May 26, 2025 | 3:30 to 4:30 PM

<strong>Program Dates: June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood, Maria Kim
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
!Project
!Projects Interested
|-
| [https://www.linkedin.com/in/gracesjchong/ Grace Chong]
|PredictMod
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/alma-ogunsina-4959072b1/ Alma Ogunsina]
|Biomarker curation
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy]
|PredictMod
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/harivinay-prasad-reddy-gujjula-a06ba71bb/ Harivinay P. Gujjula]
|GlyGen curation
|
# GlyGen Biocuration
# BioMarkerKB Biocuration
|-
|[https://www.linkedin.com/in/miao-wang-88b602290/Miao Wang Miao Wang]
|ARGOS
|
# BiomarkerKB Biocuration Project Ideas
# FDA-ARGOS Computation and Pathogen Curation Project
# PredictMod Machine Learning Project Ideas
|-
|[https://www.linkedin.com/in/nahom-gebreselassie-1545ab336/ Nahom Abel]
|GlyGen curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# PredictMod
|-
|[https://www.linkedin.com/in/kajal-patel-cs/ Kajal Sanjaykumar Patel]
|GlyGen and PubMed project
|
#PredictMod
#BiomarkerKB
#GlyGen
|-
|[https://www.linkedin.com/in/john-mccaffrey-b8850930a/ John McCaffrey]
|Biomarker curation
|
# PredictMod
# BiomarkerKB
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/nathan-ressom/ Nathan Ressom]
|ARGOS
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB Biocuration
|-
|[https://www.linkedin.com/in/aaron-ressom/ Aaron Ressom]
|PredictMod
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/akale-kinfe/ Akale Kinfe]
|
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# ARGOS
|-
|Aise Arpinar
|GlyGen curation
|
# GlyGen Biocuration
# BiomarkerKB Biocuration
# GlyGen Publication Analysis
|-
|[https://www.linkedin.com/in/piyush-pandey-906b582b5/ Piyush Pandey]
|
|
# BiomarkerKB Biocuration
# PredictMod
# GlyGen Biocuration
|-
|[http://www.linkedin.com/in/filmawit-zeru-203272363 Filmawit Zeru]
|GlycoSiteMiner project
|
# BiomarkerKB
# GlyGen
# ARGOS
|-
|[https://www.linkedin.com/in/mathias-belay-03b51a2a3/ Mathias Belay]
|
|
# GlyGen
# PredictMod
# BiomarkerKB
|}

ARGOSQC Usage Tutorial

2025-04-25T20:11:20Z

Christie.woodside: /* Where to locate NCBI Information for the inputs: */

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|thumb|667x667px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. |none]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.
[[File:Screenshot figure 2 a.png|none|thumb|657x657px|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica.]]
[[File:Screenshot fig 2a.png|none|thumb|508x508px|'''Figure 2 a)''' The SRR accession field contains both SRR fastqs for the organism that correspond to the biosample.]]
[[File:Screenshot figure 2b.png|none|thumb|502x502px|'''Figure 2 b)''' The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI. '''a & b)''' '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources. ]]

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
[[File:Screenshot fig 3.png|none|thumb|518x518px|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6). ]]
[[File:Screenshot fig 4.png|none|thumb|670x670px|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page. ]]
[[File:Screenshot fig 5.png|none|thumb|625x625px|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.]]
[[File:Screenshot fig 6.png|none|thumb|634x634px|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.]]
[[File:Screenshot fig 7.png|none|thumb|635x635px|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.]]

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

File:Screenshot fig 7.png

2025-04-25T20:10:49Z

Christie.woodside:

fig

File:Screenshot fig 6.png

2025-04-25T20:09:25Z

Christie.woodside:

fig

File:Screenshot fig 5.png

2025-04-25T20:07:37Z

Christie.woodside:

fig

File:Screenshot fig 4.png

2025-04-25T20:05:39Z

Christie.woodside:

fig

ARGOSQC Usage Tutorial

2025-04-25T20:03:53Z

Christie.woodside: /* Where to locate NCBI Information for the inputs: */

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|thumb|667x667px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. |none]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.
[[File:Screenshot figure 2 a.png|none|thumb|657x657px|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica.]]
[[File:Screenshot fig 2a.png|none|thumb|508x508px|'''Figure 2 a)''' The SRR accession field contains both SRR fastqs for the organism that correspond to the biosample.]]
[[File:Screenshot figure 2b.png|none|thumb|502x502px|'''Figure 2 b)''' The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI. '''a & b)''' '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources. ]]

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
[[File:Screenshot fig 3.png|none|thumb|518x518px|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6). ]]
{| class="wikitable"
|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page.
|-
|
|}
{| class="wikitable"
|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.
|-
|
|-
|
|}
{| class="wikitable"
|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
|-
|
|}
{| class="wikitable"
|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.
|-
|
|}

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

File:Screenshot fig 3.png

2025-04-25T20:03:06Z

Christie.woodside:

fig

ARGOSQC Usage Tutorial

2025-04-25T19:58:31Z

Christie.woodside: /* Input values: */

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|thumb|667x667px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. |none]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.
[[File:Screenshot figure 2 a.png|none|thumb|657x657px|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica.]]
[[File:Screenshot fig 2a.png|none|thumb|508x508px|'''Figure 2 a)''' The SRR accession field contains both SRR fastqs for the organism that correspond to the biosample.]]
[[File:Screenshot figure 2b.png|none|thumb|502x502px|'''Figure 2 b)''' The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI. '''a & b)''' '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources. ]]

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
{| class="wikitable"
|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6).
|-
|
|}
{| class="wikitable"
|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page.
|-
|
|}
{| class="wikitable"
|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.
|-
|
|-
|
|}
{| class="wikitable"
|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
|-
|
|}
{| class="wikitable"
|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.
|-
|
|}

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

File:Screenshot figure 2b.png

2025-04-25T19:57:05Z

Christie.woodside:

fig

File:Screenshot fig 2a.png

2025-04-25T19:55:37Z

Christie.woodside:

fig

File:Screenshot figure 2 a.png

2025-04-25T19:53:07Z

Christie.woodside:

fig

ARGOSQC Usage Tutorial

2025-04-25T19:49:37Z

Christie.woodside:

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|left|thumb|723x723px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. ]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.

{| class="wikitable"
|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica. a) The '''SRR accession''' field contains both SRR fastqs for the organism that correspond to the biosample.

b) The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI.

a & b) '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources.

Almost all of this information can be found in the legacy assembly page for this organism, shown in '''Figure 3, 4, 5, 6, 7'''.
|-
|
|-
|'''a)'''
|-
|'''b)'''
|}

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
{| class="wikitable"
|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6).
|-
|
|}
{| class="wikitable"
|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page.
|-
|
|}
{| class="wikitable"
|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.
|-
|
|-
|
|}
{| class="wikitable"
|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
|-
|
|}
{| class="wikitable"
|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.
|-
|
|}

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

ARGOSQC Usage Tutorial

2025-04-25T19:49:06Z

Christie.woodside: /* Overview */

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>

[[File:Screenshot figure 1.png|left|thumb|723x723px|'''Figure 1.''' Input settings page for the ARGOS QC pipeline. ]]

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.

{| class="wikitable"
|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica. a) The '''SRR accession''' field contains both SRR fastqs for the organism that correspond to the biosample.

b) The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI.

a & b) '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources.

Almost all of this information can be found in the legacy assembly page for this organism, shown in '''Figure 3, 4, 5, 6, 7'''.
|-
|
|-
|'''a)'''
|-
|'''b)'''
|}

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
{| class="wikitable"
|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6).
|-
|
|}
{| class="wikitable"
|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page.
|-
|
|}
{| class="wikitable"
|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.
|-
|
|-
|
|}
{| class="wikitable"
|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
|-
|
|}
{| class="wikitable"
|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.
|-
|
|}

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.

File:Screenshot figure 1.png

2025-04-25T19:45:26Z

Christie.woodside:

Figure 1 for th epipeline

ARGOSQC Usage Tutorial

2025-04-25T19:43:13Z

Christie.woodside:

HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

= Required User Information =
{| class="wikitable"
|'''Protocol Version'''
|1.0
|-
|'''HIVE Instance'''
|3
|-
|'''HIVE Link'''
|<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login</nowiki>
|}

= Overview =
We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

<nowiki>https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc</nowiki>
{| class="wikitable"
|'''Figure 1.''' Input settings page for the ARGOS QC pipeline.
|-
|
|}

== ''Input values:'' ==
On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

'''Name:''' Give the computation a name.

'''Folder:''' Give the folder where your computations, data, and steps will be stored.

* Can use: _ or - or & and letters + numbers
* Cannot use: / : ; , \ “ ” ‘ ’
* Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
* Ex folder: Influenza A (h5n1)

'''Reads:''' information needed for ngsQC

* '''SRR''': the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.
* '''HIVE reads:''' Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
* See in Figure 2a

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

'''Reference:''' Information used for the assemblyQC portion of the algorithm.

* '''Reference accession:''' This is the REFSEQ or Genbank accession number from NCBI or Genbank.
* '''Assembly ID''': This is the ASSEMBLY accession number from NCBI.
* '''HIVE genome:''' Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
* See in Figure 2b

'''''NOTE:''''' The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

'''Metadata: ''' Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

* Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

'''Coding Table:''' Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

* Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.

{| class="wikitable"
|'''Figure 2.''' The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica. a) The '''SRR accession''' field contains both SRR fastqs for the organism that correspond to the biosample.

b) The '''Reference Accession''' is the RefSeq Nucleotide accession number from NCBI, the '''Assembly ID''' is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI.

a & b) '''HIVE IDs''' are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources.

Almost all of this information can be found in the legacy assembly page for this organism, shown in '''Figure 3, 4, 5, 6, 7'''.
|-
|
|-
|'''a)'''
|-
|'''b)'''
|}

== ''Where to locate NCBI Information for the inputs:'' ==
Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.
{| class="wikitable"
|'''Figure 3.''' The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the '''“Assembly ID”''' input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6).
|-
|
|}
{| class="wikitable"
|'''Figure 4'''. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “'''Reference Accession'''” field on the HIVE3 ARGOS_QC input page.
|-
|
|}
{| class="wikitable"
|'''Figure 5.''' Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under '''“Related Information”''', click '''“SRA”''' to navigate to the NCBI SRA page for this biosample.
|-
|
|-
|
|}
{| class="wikitable"
|'''Figure 6.''' The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
|-
|
|}
{| class="wikitable"
|'''Figure 7.''' Clicking on the bottom link under the '''Runs''' section from the SRA page shown in '''Figure 6''' will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.
|-
|
|}

= Single QC Computation =
A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

'''Step 1:''' Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

'''Step 2:''' Under the dropdown menu for Reads, select which input you will use.

* For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
** Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
* For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
** Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.

{| class="wikitable"
|'''Figure 8.''' Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
|-
|
|}
{| class="wikitable"
|'''Figure 9.''' Pop-up window for SRR HIVE ID selection.
|-
|
|}
'''Step 3:''' Next to Reference, select the input object you would like to use for the computation.

* For Reference Accession, type or paste in the id you wish to use.
* For Assembly IDs, type or paste in the id you wish to use.
* For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
* Refer to the beginning of this protocol for what ids can be inputted.

'''Step 4:''' Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

'''Step 5:''' Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

'''Step 6:''' Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

'''Step 7:''' Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

= Batch Mode Computation =
Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

''They would be grouped for computations like this example:''

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459''';''' SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1''';''' GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322''';''' SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

'''Troubleshooting Note:''' If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459''';'''SRR0123451,SRR0123452,SRR0123453,SRR0123454

'''Step 1:''' Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

'''Step 2:''' For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.
{| class="wikitable"
|'''Figure 10.''' Batch mode input settings window.
|-
|
|}
'''Step 3:''' Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

* Use the black plus button next to ‘Parameter List’ to populate an entry field.
* Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
* For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.

{| class="wikitable"
|'''Figure 11.''' Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.
|-
|
|}
'''Step 4:''' Input the ratio for the batch service.

* For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
* If the ratio is 1:1:1 then enter the value 1 for each box.
* One set of SRRs to one assembly to one biosample.

'''Step 5:''' Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

== ''Batch Mode Parameter breakdown:'' ==
Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

'''Step 6:''' Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

= QC Computation Results =
Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.
{| class="wikitable"
|'''Figure 12.''' The pipeline workflow displayed in the user’s inbox.
|-
|
|}
As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “'''Post-Alignment Quality Controls'''” or under the '''“CFlow”''' workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.
{| class="wikitable"
|'''Figure 13.''' The available downloads tab and the 5 JSON files that are the QC outputs.
|-
|
|-
|
|}
There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled '''qcAll.json''' will have our assemblyQC results. '''qcNGS.json''' will have our ngsQC results and '''biosample.json''' the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.