HIVE Lab - User contributions [en]

Volunteership Summer 2026

2026-06-02T15:10:32Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date June 1, 2026 | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan, Sujeet Kulkarni
|New Project: GlycoChatbot Project
|-
|Neha Rao
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|John McCaffrey*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Jovanna Aragon
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran<sup>‡</sup>
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri<sup>†</sup>
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<sup>†</sup>GW Masters Degree Student

<sup>‡</sup>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-29T14:32:14Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date June 1, 2026 | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan, Sujeet Kulkarni
|New Project: GlycoChatbot Project
|-
|Neha Rao
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Navya Sinha
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|John McCaffrey*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Jovanna Aragon
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani<sup>†</sup>
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran<sup>‡</sup>
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri<sup>†</sup>
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<sup>†</sup>GW Masters Degree Student

<sup>‡</sup>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-28T15:54:15Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date June 1, 2026 | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan, Sujeet Kulkarni
|New Project: GlycoChatbot Project
|-
|Neha Rao
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Navya Sinha
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|John McCaffrey
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani<sup>†</sup>
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran<sup>‡</sup>
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<sup>†</sup>Masters Degree Student

<sup>‡</sup>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-28T15:47:03Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan, Rene Ranzinger
|New Project: GlycoChatbot Project
|-
|Neha Rao
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Navya Sinha
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|John McCaffrey
|GlyGen
|Kate Warner, Robel Kahsay
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani<sup>†</sup>
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran<sup>‡</sup>
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<sup>†</sup>Masters Degree Student

<sup>‡</sup>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-28T15:25:46Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan
|New Project: GlycoChatbot Project
|-
|Neha Rao
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|Navya Sinha
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|John McCaffrey
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani<sup>†</sup>
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran<sup>‡</sup>
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<sup>†</sup>Masters Degree Student

<sup>‡</sup>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-22T13:51:49Z

Urnisha.bhuiyan: /* Volunteers */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]

Presentation slides from the Spring 2026 volunteership symposium are publicly available on [https://zenodo.org/records/20072087 Zenodo] to highlight student research contributions from the program.
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. GlycoSiteMiner Curation Project ====
POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 2. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

==== 3. GlyGen Publication Analysis Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either task 1 or task 2. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable sortable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|BiomarkerKB
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora, Maria Kim
|BiomarkerKB
|-
|Rhea Charles
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan
|GlyGen Publication Analysis Project
|-
|Swapnaneel Chatterjee
|GlyGen
|Urnisha Bhuiyan
|New Project: GlycoChatbot Project
|-
|Caleb Hailu
|Pending
|Pending
|GlyGen Biocuration Project
|-
|Nahom Abel*
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|Mathias Belay*
|BCO User Research
|Lori Krammer, Pat McNeely
|BCO, GlycoSiteMiner, BiomarkerKB
|-
|Arjun Agnihothram
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML, BiomarkerKB, GlycoSiteMiner
|-
|Aryan Jagani
|PredictMod
|Lori Krammer, Pat McNeely
|PredictMod ML
|-
|Cynthia Li
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlyGen, PredictMod
|-
|Vishal Muthusekaran**
|BiomarkerKB
|Cyrus Au Yeung, Maria Kim
|BiomarkerKB
|-
|Dia Jhaveri
|BiomarkerKB
|Maria Kim
|BiomarkerKB, GlycoSiteMiner, BCO, PredictMod, GlyGen
|-
|Arthur Issler
|BiomarkerKB
|Maria Kim, Jeet Vora
|BiomarkerKB, GlycoSiteMiner, GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-05-12T20:10:06Z

Urnisha.bhuiyan: /* Volunteers (TBD) */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Spring 2026 Volunteership]]
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''
----

=== Volunteership Support ===
Each group has dedicated Points of Contact (PoCs) who are your main resource for questions and guidance.

<u>How to Get Help</u>

''Slack Group Channel''

Use your group Slack channel as the primary place to ask questions and share ideas. This is strongly encouraged so everyone can learn together. Direct messages to PoCs are discouraged.

''Office Hours''

PoCs will host group office hours every two weeks once the program begins. These sessions are a space to ask questions, discuss ideas, and collaborate live.

<u>How to get support</u>

- Use the Slack channel as your first point of contact (if you are not yet in the Slack channel, then email your PoC at mazumder_lab AT gwu.edu)

- Follow up with your PoCs in the group channel

- Come prepared with questions for office hours

- Participate in discussions and support your peers

Our goal is to create an open, collaborative environment where everyone can learn and contribute.
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either '''task 1''' or '''task 2'''. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.

'''2. GlycoSiteMiner Curation Project'''

POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 3. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''4. GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 5. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets & trained model scripts pushed to GitHub
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''7. FDA-ARGOS Computation and Pathogen Curation Project'''

This volunteership is currently paused.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|Review EHR Normal Ranges
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora/Maria
|TBD
|-
|Rhea Charles
|PredictMod
|Lori Krammer
|PredictMod ML
|-
|Sri Piramanayagam
|PredictMod
|Lori Krammer
|PredictMod ML, BiomarkerKB, GlyGen, BCO
|-
|Taylor Dimenna
|GlyGen
|Urnisha Bhuiyan
|GlyGen Biocuration Project
|-
|Daniel Auerbach
|GlyGen
|Urnisha Bhuiyan
|GlyGen Publication Analysis Project
|-
|Nahom Abel
|GlyGen
|Kate Warner
|GlycoSiteMiner Curation Project
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-09T17:26:13Z

Urnisha.bhuiyan: /* 3. GlyGen Biocuration Project */ added links to glygen volunteership

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''

----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either '''task 1''' or '''task 2'''. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.

'''2. GlycoSiteMiner Curation Project'''

POC: Kate Warner

GlycoSiteMiner (PMID: [https://pubmed.ncbi.nlm.nih.gov/40401984/ 40401984]) is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker (https://glygen.ccrc.uga.edu/tablemaker), a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 3. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''4. GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 5. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''7. FDA-ARGOS Computation and Pathogen Curation Project'''

This volunteership is currently paused.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|Review EHR Normal Ranges
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora/Maria
|TBD
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-06T14:23:08Z

Urnisha.bhuiyan: /* 3. GlyGen Biocuration Project */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''

----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either '''task 1''' or '''task 2'''. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.

'''2. GlycoSiteMiner Curation Project'''

POC: Kate Warner

GlycoSiteMiner is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker, a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 3. GlyGen Biocuration Project ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''4. GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 5. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''7. FDA-ARGOS Computation and Pathogen Curation Project'''

This volunteership is currently paused.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|Review EHR Normal Ranges
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora/Maria
|TBD
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-06T14:22:12Z

Urnisha.bhuiyan: /* 3. GlyGen Biocuration Project Ideas */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''

----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either '''task 1''' or '''task 2'''. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.

'''2. GlycoSiteMiner Curation Project'''

GlycoSiteMiner is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker, a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 3. GlyGen Biocuration Project ====
POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''4. GlyGen Publication Analysis Project'''

POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 5. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''7. FDA-ARGOS Computation and Pathogen Curation Project'''

This volunteership is currently paused.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|Review EHR Normal Ranges
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora/Maria
|TBD
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-06T14:21:42Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

May 15, 2026 | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 20 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Volunteers should be responsive to email/slack communications.
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Volunteers are expected to attend volunteership events such as a symposium.
# Attend some lectures or seminars remotely (max 4-5).
# This volunteership does not allow for vacation time.

'''''Important:''' '''If the scrum is not updated for 2 consecutive working days,''' '''the candidate will be automatically dropped from the program.'''''

----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project ====
POC: Jeet Vora (primary), Maria Kim, Cyrus Au-Yeung

[https://biomarkerkb.org/about/ BiomarkerKB] is a biomedical knowledgebase project focused on harmonizing and structuring biomarker knowledge from scientific literature and public resources. We are currently recruiting individuals with experience working with LLMs (e.g. Claude, ChatGPT) to support the following tasks:

# '''Validation of existing published biomarkers from scientific literature (JV, MK, CA)'''
#* Review and validate previously reported biomarkers by checking the original literature, confirming evidence support, and standardizing biomarker annotations
#* Assess the evidence strength of biomarkers and identify additional literature to strengthen the support for biomarker claims
# '''Curation of novel biomarkers from scientific literature (MK)'''
#* Curate high-quality biomarkers for a selected disease area, organize the findings into a structured dataset
#* Standardize biomarker representations using controlled vocabularies and ontologies and classify biomarkers by their biomarker types
#* Construct and test-query a disease-specific biomarker knowledge graph (optional)
# '''Electronic Health Records Normal Entity Data Integration (JV)'''
#* Identify relevant EHR data elements (lab tests, diagnoses, procedures)
#* Map entities to standard terminologies (e.g., SNOMED CT, LOINC, ICD codes)
#* Resolve ambiguities and inconsistencies in mapping, clinical terminology
# '''Front-end testing for BiomarkerKB.org (MK, JV)'''
#* Test the BiomarkerKB web interface for functionality and data presentation, and document issues / improvement suggestions for the development team
# '''Benchmarking and LLM-based biomarker extraction (optional*) (CA)'''
#* Construct manually curated biomarker reference sets in the glycobiology domain to support benchmarking of LLM-based knowledge extraction pipelines.
#* Apply an LLM workflow to extract disease-specific biomarkers from literature and comparing model outputs against the manually curated benchmark sets

''Note:'' Participation in the benchmarking and LLM-based biomarker extraction subproject depends on sufficient progress in either '''task 1''' or '''task 2'''. Volunteers are expected to first complete either validation of an LLM-extracted glycobiology subset or comprehensive curation of a disease-specific biomarker set before beginning this component. Because this volunteership is structured around a 20-hour-per-week commitment, participation in this part is not guaranteed.

Individuals interested in this opportunity may reach out to Jeet Vora ([mailto:jeetvora@gwu.edu jeetvora@gwu.edu]) for project details.

'''2. GlycoSiteMiner Curation Project'''

GlycoSiteMiner is a large language model (LLM)-based tool developed by the GlyGen team to automate a literature-mining pipeline that extracts experimentally validated, protein sequence–specific glycosylation sites from PubMed abstracts. By leveraging natural language processing, GlycoSiteMiner accelerates the identification of glycosylation evidence that would otherwise require extensive manual review.

The objective of this project is to validate these text-mined entries and curate them into structured datasets using GlyTableMaker, a companion tool designed to support the deposition of glycans and glycoproteins, assignment of standardized metadata, and generation of high-quality Excel/CSV tables. This process ensures that extracted information is accurate, consistent, and suitable for integration into GlyGen’s knowledgebase.

This opportunity provides hands-on experience in biocuration workflows, including data validation, standardization, and quality control. Participants will deepen their understanding of glycobiology concepts, gain practical experience working with biological databases, and develop skills in evaluating and refining LLM-generated outputs for scientific applications.

==== 3. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

Two projects:

# Taking predicted sites and curating them using table maker
# Website testing (all volunteers)

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''4. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 5. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 6. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''7. FDA-ARGOS Computation and Pathogen Curation Project'''

This volunteership is currently paused.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Sahana Adusumilli
|BiomarkerKB
|Jeet Vora
|Review EHR Normal Ranges
|-
|Abhirama Chillara
|BiomarkerKB
|Jeet Vora/Maria
|TBD
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-02T20:58:59Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

Date TBD | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely (optional)

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''6. FDA-ARGOS Computation and Pathogen Curation Project'''
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Summer 2026

2026-04-02T20:58:35Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

== 2026 Summer Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

Date TBD | 12:00 PM ET

Please email your updated resume and projects in order of preference. Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

Date TBD | 11:00 AM to 12:00 PM

'''Program Dates: June 1, 2026 – July 31, 2026''' (9 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Spring 2026|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Summer 2026. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email ''mazumder_lab@gwu.edu'' your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger, Kate Warner, and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely (optional)

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see our [[Recommended Publications for Intervention Outcome Prediction Models|Recommended Publications for IOPMs]] page). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results

Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

==== 5. BioCompute Objects User Research Project ====
POC: Lori Krammer, Pat McNeely

Volunteers will conduct individual audits and user researcher to improve the human readability of BioCompute Objects (BCOs) and the project documentation. This volunteership will involve user research, prototyping, and documentation.

Tasks associated with the project include:

# Reviewing existing documentation to gain a comprehensive understanding of BioCompute Objects, their relevance to bioinformatics, and key user personas. The volunteer will identify and report gaps in the current documentation.
# Conducting user research to understand pain points and desired outcomes. The volunteer will develop user stories based on interviews with BCO users.
# Prototyping improvements to the BCO documentation and/or portal based on user stories. This could involve visual diagrams, wiki restructuring, or decision logs.

Deliverables will include:

# User research report with user story maps
# BCO documentation improvement plan
# Volunteership documentation (final report, progress updates, symposium presentation)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''6. FDA-ARGOS Computation and Pathogen Curation Project'''
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 9-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

<nowiki>**</nowiki>Not directly involved in the semester curriculum; long-term volunteer.

== Summer 2026 Symposium ==
The Summer symposium will be held virtually.

'''Date:''' TBD

'''Time:''' 4 - 6 PM

'''Zoom Link''' - TBA

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|
| colspan="2" |
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
|
|
|
|-
|
| colspan="2" |
|
|}

Volunteership Spring 2026

2026-01-28T13:36:16Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

== 2026 Spring Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

January 9, 2026, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

January 12, 2026 | 11:00 AM to 12:00 PM

'''Program Dates: January, 2026 – April, 2026''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Fall 2025|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Maria Kim, Cyrus Yeung, Jeet Vora

# Curate biomarkers for a specific disease or for a treatment
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on NLP/LLM methods.
# Continue working on LLM methods started by volunteers in Fall 2025.
::: The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Kate Warner, Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding; however, the data contained within them remains highly valuable to the research community. Integrating these legacy datasets into modern databases or knowledgebases, such as GlyGen, presents a significant challenge because much of the associated metadata (e.g., species, tissue, disease, cell line) is recorded as free-text that does not conform to the standardized dictionaries and ontologies used by current resources.

To address this challenge, this project will leverage large language models (LLMs) to automate the mapping of free-text metadata from legacy databases, specifically CarbBank and CFG, to standardized accessions in authoritative resources such as NCBI Taxonomy, Disease Ontology, and Cellosaurus. The LLM-based workflow will identify and normalize synonyms, abbreviations, and spelling variants (e.g., “human,” “man,” or “h. sapiens” mapped to Homo sapiens), enabling scalable and reproducible metadata harmonization that would otherwise require extensive manual curation. The LLM tasks will be performed using OpenAI resources integrated into the GlyGen curation pipeline. The project involves the development of Python scripts to read and write data, invoke the OpenAI API and compare results with manual curated data. Another aspect of the work is the development and finetunning of a prompt for ChatGPT to ensure reliable and accurate mapping is produced.

While the mapping process will be largely automated, manual validation will be incorporated as a quality-control step to assess model performance, verify correctness, and identify edge cases requiring refinement. This hybrid approach significantly reduces curator burden while ensuring high-quality, ontology-aligned annotations.

The goal of this effort is to migrate and modernize datasets from CarbBank and CFG, making them interoperable with GlyGen and other contemporary glycoinformatics resources through a scalable, AI-assisted curation strategy.

For any questions, please contact Rene Ranzinger (rene@ccrc.uga.edu) or Kate Warner (k.warner1@email.gwu.edu).

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning (ML) Modeling Project ====
POC: Lori Krammer, Pat McNeely (optional)

Volunteers will conduct ML modeling using publicly-available -omics datasets that were previously identified (see [[Recommended Publications for Intervention Outcome Prediction Models|https://hivelab.biochemistry.gwu.edu/wiki/Recommended_Publications_for_Intervention_Outcome_Prediction_Models]]). This volunteership will involve data harmonization, model training, and pipeline documentation.

Tasks associated with this project include:

# Exploring and understanding the data found in relevant PMIDs that can be used to train intervention outcome prediction models.
# Preparing the data for model training and model performance evaluation
# Testing the modeling tutorial, PredictMod platform, and associated project tools
# Documentation of the ML pipeline and testing results
Deliverables for this project include:

# ML-ready datasets
# Trained model scripts
# Pipeline documentation captured in BioCompute Objects (BCOs) and testing reports
# Volunteership documentation (final report or weekly progress reports)

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and a final presentation of your work.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

''Note:'' For anyone interested in ARGOS, you may be assigned to another project of your choice. This project is contingent on a contract extension. Please complete your project selection in order of preference.

POC: Christie Woodside

Qualifications: basic/medium programming skills, knowledgeable of basic bioinformatics platforms and skills.

# Curate and report on currently circulating pathogens to upload to ARGOS
## The student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.
# Report Results
## Defend your pathogens you have selected to be added to the database. Explain their importance and what value they would hold to the scientific community if they were added.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Spring.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer; Urnisha Bhuiyan; Rene Ranzinger
|PredictMod; Glyco web development
|-
|Sampurna Chakravorty
|PredictMod
|Lori Krammer
|PredictMod; ARGOS; BiomarkerKB
|-
|Vishal Muthusekaran*
|BiomarkerKB
|Maria Kim; Cyrus Yeung; Jeet Vora
|BiomarkerKB
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien*]
|PredictMod
|Lori Krammer
|PredictMod
|-
|[https://www.linkedin.com/in/conner-cognata/ Conner Cognata]
|BiomarkerKB
|Maria Kim; Cyrus Yeung; Jeet Vora
|BiomarkerKB; PredictMod; GlyGen biocuration
|-
|Venya Gulati
|ARGOS
|Christie Woodside
|ARGOS; PredictMod; BiomarkerKB
|-
|Isaac Kim
|
|
|PredictMod; GlyGen biocuration; ARGOS
|-
|Yashitha Pobbareddy
|ARGOS
|Christie Woodside
|ARGOS; GlyGen biocuration; BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Spring 2026

2026-01-05T15:05:45Z

Urnisha.bhuiyan: /* Volunteers (TBD) */

== 2026 Spring Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

January 9, 2026, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

January 12, 2026 | 4:00 to 5:00 PM

'''Program Dates: January, 2026 – April, 2026''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Fall 2025|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# 30-minute Zoom meetings (during regular work hours) once a week or every other week with the assigned project point of contact (POC).
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen. <u>We are also looking for individuals who have previously worked with us to take on a coordinator role</u>.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Maria Kim, Cyrus Yeung, Jeet Vora

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu. Please note that this project requires attendance at biweekly meetings and weekly 1-2 paragraph reports.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

''Note:'' For anyone interested in ARGOS, you may be assigned to another project of your choice. This project is contingent on a contract extension. Please complete your project selection in order of preference.

POC: Christie Woodside

Qualifications: basic/medium programming skills, knowledgeable of basic bioinformatics platforms and skills.

# Curate and report on currently circulating pathogens to upload to ARGOS
## The student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.
# Report Results
## Defend your pathogens you have selected to be added to the database. Explain their importance and what value they would hold to the scientific community if they were added.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Spring.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program. Additional recognition will be given to the top three volunteers with exceptional presentations at the end of the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer; Urnisha Bhuiyan; Rene Ranzinger
|PredictMod; Glyco web development
|-
|Vishal Bakshi
|PredictMod
|Lori Krammer
|PredictMod
|-
|Sampurna Chakravorty
|PredictMod
|Lori Krammer
|PredictMod; ARGOS; BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-25T13:56:19Z

Urnisha.bhuiyan: /* Volunteers (TBD) */ Changed Anika to PredictMod project

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Farah Kamila
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BioMarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-22T15:01:50Z

Urnisha.bhuiyan: /* Volunteers (TBD) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|-
|Anika Sikka
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-22T14:07:03Z

Urnisha.bhuiyan: /* Contact */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|-
|Anika Sikka
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-22T14:06:39Z

Urnisha.bhuiyan: /* Volunteers (TBD) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|
|GlyGen
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|-
|Anika Sikka
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-22T14:05:33Z

Urnisha.bhuiyan: /* Volunteers (TBD) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|
|GlyGen
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula
|
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|-
|Anika Sikka
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-08-14T16:27:45Z

Urnisha.bhuiyan: /* Volunteers (TBD) */ Added Harivinay to volunteer list.

== 2025 Volunteer Program Details ==

=== Dates ===
'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attending remotely some lectures or seminars (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast, and liver cancer, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!Projects Interested
|-
|Diya Kamalabharathy
|
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|GlyGen
|-
|Akale Kinfe
|
|BiomarkerKB
|-
|Nahom Abel
|
|BiomarkerKB
|-
|Harivinay P. Gujjula
|
|GlyGen
|-
|
|
|
|}

Volunteership Fall 2025

2025-08-08T14:24:33Z

Urnisha.bhuiyan: /* Volunteers (TBD) */ Added Nahom to BiomarkerKB

== 2025 Volunteer Program Details ==

=== Dates ===
'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attending remotely some lectures or seminars (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Pat McNeely

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast, and liver cancer, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project
!Projects Interested
|-
|''Anika Sikka (tentative)''
|
|
#GlyGen
|-
|''Akale Kinfe''
|
|
# BiomarkerKB
|-
|''Nahom Abel''
|
|
# BiomarkerKB
|}

Symposium 2025

2025-07-17T15:13:53Z

Urnisha.bhuiyan: /* Agenda */ fixed typo

The HIVE Lab symposium is scheduled for Thursday July 31, 2025. It is an exciting time for the lab volunteers and interns to present their finding on the projects they worked on for 8 weeks.

[[File:DC.png|center|frame]]

== '''Program and Information''' ==

=== '''Symposium Venue''' ===
The HIVE lab symposium will held in person at The George Washington University, Washington DC with an option to join virtually.

In Person - Ross 647, Ross Hall, School of Health and Medical Sciences, The George Washington University, Washington DC ([https://maps.app.goo.gl/PHQmZacA4hWDvTCh6 MAP])

Virtual - Zoom

== '''Agenda''' ==
All times in Eastern Standard Time
{| class="wikitable"
|'''Time (ET)'''
|'''Project'''
|'''Title'''
|'''Presenter'''
|-
|'''10:00am'''
| colspan="2" | '''Welcome and Introduction'''
|'''Michael Tiemeyer (10 min)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|10:10am
|CFDE
|Integrating Biocuration and Data Standardization to Generate Machine Learning-Ready Glycan Datasets
|Ana Jaramillo and Yuxin Zou (20 min)
|-
|10:30am
|CFDE
|
|Campbell Ross (15 min)
|-
|10:45am
|CFDE
|A Graph-Based AI Workflow for Mining Glycan Biomarkers and Related Annotations from Publications
|Cyrus Chun Hong Au Yeung (15 min)
|-
|11:00am
|BiomarkerKB
|
|(15 min)
|-
|11:15am
|BiomarkerKB
|
|(15 min)
|-
|11:30am
|BiomarkerKB
|
|(15 min)
|-
|'''11:45am'''
| colspan="2" |'''Open Q and A'''
|'''All (30 min)'''
|-
|12:30pm
| colspan="3" | '''LUNCH (90 mins)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|2:00pm
|Predictmod AI-READI
|Robust Classification of Glycemic Health States from Continuous Glucose
|Nikhil Arethiya (15 min)
|-
|2:15pm
|Predictmod Curation
|PredictMod: PubMed Curation for Training an LLM for Recommendation
|Grace Chong, Aaron Ressom, Diya Kamalabharathy (15 min)
|-
|2:30pm
|Argos
|
|(15 min)
|-
|2:45pm
|GlyGen
|GlyGen Biocuration Project
|Aise Arpinar, Haravinay P. Gujjulla, Nahom Abel (20 min)
|-
|3:05pm
|GlycoSiteMineros
|
|(15 min)
|-
|3:20pm
|Glycobiology Web Development
|A Resource Drill Down and Visualization for the Glyspace Alliance
|Diya Kamalabharathy (5 min)
|-
|'''3:25pm'''
| colspan="2" |'''Open Q and A'''
|'''All (20 min)'''
|-
|3:45pm
| colspan="2" | '''Closing Remarks'''
|'''Raja Mazumder'''
|}

== '''Project Description''' ==

=== GlyGen Project ===
The GlyGen Biocuration project focuses on integrating legacy, yet valuable, data from the CarbBank and CFG databases into the GlyGen infrastructure. A key challenge is mapping metadata, such as species names and publication references, to standardized dictionaries and ontologies. While most entries have been automatically matched using custom scripts, remaining inconsistencies, including outdated, misspelled, or abbreviated terms, require manual curation using resources such as Google, PubMed, and domain-specific dictionaries and ontologies.

Symposium 2025

2025-07-17T15:13:17Z

Urnisha.bhuiyan: /* Agenda */ updated presenters for glygen

The HIVE Lab symposium is scheduled for Thursday July 31, 2025. It is an exciting time for the lab volunteers and interns to present their finding on the projects they worked on for 8 weeks.

[[File:DC.png|center|frame]]

== '''Program and Information''' ==

=== '''Symposium Venue''' ===
The HIVE lab symposium will held in person at The George Washington University, Washington DC with an option to join virtually.

In Person - Ross 647, Ross Hall, School of Health and Medical Sciences, The George Washington University, Washington DC ([https://maps.app.goo.gl/PHQmZacA4hWDvTCh6 MAP])

Virtual - Zoom

== '''Agenda''' ==
All times in Eastern Standard Time
{| class="wikitable"
|'''Time (ET)'''
|'''Project'''
|'''Title'''
|'''Presenter'''
|-
|'''10:00am'''
| colspan="2" | '''Welcome and Introduction'''
|'''Michael Tiemeyer (10 min)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|10:10am
|CFDE
|Integrating Biocuration and Data Standardization to Generate Machine Learning-Ready Glycan Datasets
|Ana Jaramillo and Yuxin Zou (20 min)
|-
|10:30am
|CFDE
|
|Campbell Ross (15 min)
|-
|10:45am
|CFDE
|A Graph-Based AI Workflow for Mining Glycan Biomarkers and Related Annotations from Publications
|Cyrus Chun Hong Au Yeung (15 min)
|-
|11:00am
|BiomarkerKB
|
|(15 min)
|-
|11:15am
|BiomarkerKB
|
|(15 min)
|-
|11:30am
|BiomarkerKB
|
|(15 min)
|-
|'''11:45am'''
| colspan="2" |'''Open Q and A'''
|'''All (30 min)'''
|-
|12:30pm
| colspan="3" | '''LUNCH (90 mins)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|2:00pm
|Predictmod AI-READI
|Robust Classification of Glycemic Health States from Continuous Glucose
|Nikhil Arethiya (15 min)
|-
|2:15pm
|Predictmod Curation
|PredictMod: PubMed Curation for Training an LLM for Recommendation
|Grace Chong, Aaron Ressom, Diya Kamalabharathy (15 min)
|-
|2:30pm
|Argos
|
|(15 min)
|-
|2:45pm
|GlyGen
|GlyGen Biocuration Project
|(20 min)
|-
|3:05pm
|GlycoSiteMineros
|
|Aise Arpinar, Haravinay P. Gujjulla, Nahom Abel (15 min)
|-
|3:20pm
|Glycobiology Web Development
|A Resource Drill Down and Visualization for the Glyspace Alliance
|Diya Kamalabharathy (5 min)
|-
|'''3:25pm'''
| colspan="2" |'''Open Q and A'''
|'''All (20 min)'''
|-
|3:45pm
| colspan="2" | '''Closing Remarks'''
|'''Raja Mazumder'''
|}

== '''Project Description''' ==

=== GlyGen Project ===
The GlyGen Biocuration project focuses on integrating legacy, yet valuable, data from the CarbBank and CFG databases into the GlyGen infrastructure. A key challenge is mapping metadata, such as species names and publication references, to standardized dictionaries and ontologies. While most entries have been automatically matched using custom scripts, remaining inconsistencies, including outdated, misspelled, or abbreviated terms, require manual curation using resources such as Google, PubMed, and domain-specific dictionaries and ontologies.

Symposium 2025

2025-07-17T15:10:24Z

Urnisha.bhuiyan: /* Agenda */ Changed GlyGen Title

The HIVE Lab symposium is scheduled for Thursday July 31, 2025. It is an exciting time for the lab volunteers and interns to present their finding on the projects they worked on for 8 weeks.

[[File:DC.png|center|frame]]

== '''Program and Information''' ==

=== '''Symposium Venue''' ===
The HIVE lab symposium will held in person at The George Washington University, Washington DC with an option to join virtually.

In Person - Ross 647, Ross Hall, School of Health and Medical Sciences, The George Washington University, Washington DC ([https://maps.app.goo.gl/PHQmZacA4hWDvTCh6 MAP])

Virtual - Zoom

== '''Agenda''' ==
All times in Eastern Standard Time
{| class="wikitable"
|'''Time (ET)'''
|'''Project'''
|'''Title'''
|'''Presenter'''
|-
|'''10:00am'''
| colspan="2" | '''Welcome and Introduction'''
|'''Michael Tiemeyer (10 min)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|10:10am
|CFDE
|Integrating Biocuration and Data Standardization to Generate Machine Learning-Ready Glycan Datasets
|Ana Jaramillo and Yuxin Zou (20 min)
|-
|10:30am
|CFDE
|
|Campbell Ross (15 min)
|-
|10:45am
|CFDE
|A Graph-Based AI Workflow for Mining Glycan Biomarkers and Related Annotations from Publications
|Cyrus Chun Hong Au Yeung (15 min)
|-
|11:00am
|BiomarkerKB
|
|(15 min)
|-
|11:15am
|BiomarkerKB
|
|(15 min)
|-
|11:30am
|BiomarkerKB
|
|(15 min)
|-
|'''11:45am'''
| colspan="2" |'''Open Q and A'''
|'''All (30 min)'''
|-
|12:30pm
| colspan="3" | '''LUNCH (90 mins)'''
|-
| colspan="4" | ''Group 1 Moderator : Nathan Edwards''
|-
|2:00pm
|Predictmod AI-READI
|Robust Classification of Glycemic Health States from Continuous Glucose
|Nikhil Arethiya (15 min)
|-
|2:15pm
|Predictmod Curation
|PredictMod: PubMed Curation for Training an LLM for Recommendation
|Grace Chong, Aaron Ressom, Diya Kamalabharathy (15 min)
|-
|2:30pm
|Argos
|
|(15 min)
|-
|2:45pm
|GlyGen
|GlyGen Biocuration Project
|(20 min)
|-
|3:05pm
|GlycoSiteMineros
|
|(15 min)
|-
|3:20pm
|Glycobiology Web Development
|A Resource Drill Down and Visualization for the Glyspace Alliance
|Diya Kamalabharathy (5 min)
|-
|'''3:25pm'''
| colspan="2" |'''Open Q and A'''
|'''All (20 min)'''
|-
|3:45pm
| colspan="2" | '''Closing Remarks'''
|'''Raja Mazumder'''
|}

== '''Project Description''' ==

=== GlyGen Project ===
The GlyGen Biocuration project focuses on integrating legacy, yet valuable, data from the CarbBank and CFG databases into the GlyGen infrastructure. A key challenge is mapping metadata, such as species names and publication references, to standardized dictionaries and ontologies. While most entries have been automatically matched using custom scripts, remaining inconsistencies, including outdated, misspelled, or abbreviated terms, require manual curation using resources such as Google, PubMed, and domain-specific dictionaries and ontologies.

Volunteership 2025

2025-07-15T14:13:51Z

Urnisha.bhuiyan: /* Volunteers */ Updated Aise's Linkedln Link

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<strong>Volunteer Zoom Kick-Off Meeting</strong><br>
May 27, 2025 | 3:30 to 4:30 PM

<strong>Program Dates: June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood, Maria Kim
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Curation:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.

Modeling & Integration (for those with experience in programming/ML)

# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.
# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
!Project
!Projects Interested
|-
| [https://www.linkedin.com/in/gracesjchong/ Grace Chong]
|PredictMod
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/alma-ogunsina-4959072b1/ Alma Ogunsina]
|Biomarker curation
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy]
|PredictMod
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/harivinay-prasad-reddy-gujjula-a06ba71bb/ Harivinay P. Gujjula]
|GlyGen curation
|
# GlyGen Biocuration
# BioMarkerKB Biocuration
|-
|[https://www.linkedin.com/in/miao-wang-88b602290/Miao Wang Miao Wang]
|ARGOS
|
# BiomarkerKB Biocuration Project Ideas
# FDA-ARGOS Computation and Pathogen Curation Project
# PredictMod Machine Learning Project Ideas
|-
|[https://www.linkedin.com/in/nahom-gebreselassie-1545ab336/ Nahom Abel]
|GlyGen curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# PredictMod
|-
|[https://www.linkedin.com/in/kajal-patel-cs/ Kajal Sanjaykumar Patel]
|GlyGen and PubMed project
|
#PredictMod
#BiomarkerKB
#GlyGen
|-
|[https://www.linkedin.com/in/john-mccaffrey-b8850930a/ John McCaffrey]
|Biomarker curation
|
# PredictMod
# BiomarkerKB
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/nathan-ressom/ Nathan Ressom]
|BiomarkerKB
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB Biocuration
|-
|[https://www.linkedin.com/in/aaron-ressom/ Aaron Ressom]
|PredictMod
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/akale-kinfe/ Akale Kinfe]
|Biomarker curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# ARGOS
|-
|[https://www.linkedin.com/in/aise-arpinar-a8bb9b373/?original_referer= Aise Arpinar]
|GlyGen curation
|
# GlyGen Biocuration
# BiomarkerKB Biocuration
# GlyGen Publication Analysis
|-
|[https://www.linkedin.com/in/piyush-pandey-906b582b5/ Piyush Pandey]
|Biomarker curation
|
# BiomarkerKB Biocuration
# PredictMod
# GlyGen Biocuration
|-
|[http://www.linkedin.com/in/filmawit-zeru-203272363 Filmawit Zeru]
|GlycoSiteMiner project
|
# BiomarkerKB
# GlyGen
# ARGOS
|-
|[https://www.linkedin.com/in/mathias-belay-03b51a2a3/ Mathias Belay]
|Biomarker curation
|
# GlyGen
# PredictMod
# BiomarkerKB
|-
|[https://www.linkedin.com/in/isaac-kim-b644bb231/ Isaac Kim]
|Biomarker curation
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|Sohana Bahl
|Biomarker curation
|
# BiomarkerKB
|-
|[https://www.linkedin.com/in/ana-vohralikova-794a4433a?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app Ana Vohralikova]
|Biomarker curation
|
# BiomarkerKB Biocuration Project
# GlyGen Biocuration Project
# FDA-ARGOS Computation and Pathogen
|}

Volunteership 2025

2025-05-16T14:32:12Z

Urnisha.bhuiyan:

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<strong>Volunteer Zoom Kick-Off Meeting</strong><br>
May 27, 2025 | 3:30 to 4:30 PM

<strong>Program Dates: June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood, Maria Kim
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
!Project
!Projects Interested
|-
| [https://www.linkedin.com/in/gracesjchong/ Grace Chong]
|PredictMod (confirmed)
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/alma-ogunsina-4959072b1/ Alma Ogunsina]
|Biomarker curation
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy]
|PredictMod (confirmed)
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/harivinay-prasad-reddy-gujjula-a06ba71bb/ Harivinay P. Gujjula]
|GlyGen curation
|
# GlyGen Biocuration
# BioMarkerKB Biocuration
|-
|[https://www.linkedin.com/in/miao-wang-88b602290/Miao Wang Miao Wang]
|ARGOS
|
# BiomarkerKB Biocuration Project Ideas
# FDA-ARGOS Computation and Pathogen Curation Project
# PredictMod Machine Learning Project Ideas
|-
|[https://www.linkedin.com/in/nahom-gebreselassie-1545ab336/ Nahom Abel]
|GlyGen curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# PredictMod
|-
|[https://www.linkedin.com/in/kajal-patel-cs/ Kajal Sanjaykumar Patel]
|GlyGen and PubMed project
|
#PredictMod
#BiomarkerKB
#GlyGen
|-
|[https://www.linkedin.com/in/john-mccaffrey-b8850930a/ John McCaffrey]
|Biomarker curation
|
# PredictMod
# BiomarkerKB
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/nathan-ressom/ Nathan Ressom]
|ARGOS
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB Biocuration
|-
|[https://www.linkedin.com/in/aaron-ressom/ Aaron Ressom]
|PredictMod (invited)
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/akale-kinfe/ Akale Kinfe]
|Biomarker curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# ARGOS
|-
|Aise Arpinar
|GlyGen curation
|
# GlyGen Biocuration
# BiomarkerKB Biocuration
# GlyGen Publication Analysis
|-
|[https://www.linkedin.com/in/piyush-pandey-906b582b5/ Piyush Pandey]
|Biomarker curation
|
# BiomarkerKB Biocuration
# PredictMod
# GlyGen Biocuration
|-
|[http://www.linkedin.com/in/filmawit-zeru-203272363 Filmawit Zeru]
|GlycoSiteMiner project
|
# BiomarkerKB
# GlyGen
# ARGOS
|-
|[https://www.linkedin.com/in/mathias-belay-03b51a2a3/ Mathias Belay]
|Biomarker curation
|
# GlyGen
# PredictMod
# BiomarkerKB
|-
|Gladys Ndalama
|PredictMod (confirmed)
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB
|-
|Isaac Kim
|Biomarker curation
|
# BiomarkerKB
# PredictMod
# GlyGen
|}

Volunteership 2025

2025-05-12T13:09:00Z

Urnisha.bhuiyan: /* Volunteers */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<strong>Volunteer Zoom Kick-Off Meeting</strong><br>
May 26, 2025 | 3:30 to 4:30 PM

<strong>Program Dates: June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood, Maria Kim
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
!Project
!Projects Interested
|-
| [https://www.linkedin.com/in/gracesjchong/ Grace Chong]
|PredictMod
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/alma-ogunsina-4959072b1/ Alma Ogunsina]
|Biomarker curation
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy]
|PredictMod
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/harivinay-prasad-reddy-gujjula-a06ba71bb/ Harivinay P. Gujjula]
|GlyGen curation
|
# GlyGen Biocuration
# BioMarkerKB Biocuration
|-
|[https://www.linkedin.com/in/miao-wang-88b602290/Miao Wang Miao Wang]
|ARGOS
|
# BiomarkerKB Biocuration Project Ideas
# FDA-ARGOS Computation and Pathogen Curation Project
# PredictMod Machine Learning Project Ideas
|-
|[https://www.linkedin.com/in/nahom-gebreselassie-1545ab336/ Nahom Abel]
|GlyGen curation
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# PredictMod
|-
|[https://www.linkedin.com/in/kajal-patel-cs/ Kajal Sanjaykumar Patel]
|GlyGen and PubMed project
|
#PredictMod
#BiomarkerKB
#GlyGen
|-
|[https://www.linkedin.com/in/john-mccaffrey-b8850930a/ John McCaffrey]
|Biomarker curation
|
# PredictMod
# BiomarkerKB
# GlyGen Biocuration
|-
|[https://www.linkedin.com/in/nathan-ressom/ Nathan Ressom]
|ARGOS
|
# PredictMod
# GlyGen Biocuration
# BiomarkerKB Biocuration
|-
|[https://www.linkedin.com/in/aaron-ressom/ Aaron Ressom]
|PredictMod
|
# BiomarkerKB
# PredictMod
# GlyGen
|-
|[https://www.linkedin.com/in/akale-kinfe/ Akale Kinfe]
|
|
# BiomarkerKB Biocuration
# GlyGen Biocuration
# ARGOS
|-
|Aise Arpinar
|GlyGen curation
|
# GlyGen Biocuration
# BiomarkerKB Biocuration
# GlyGen Publication Analysis
|-
|[https://www.linkedin.com/in/piyush-pandey-906b582b5/ Piyush Pandey]
|
|
# BiomarkerKB Biocuration
# PredictMod
# GlyGen Biocuration
|-
|[http://www.linkedin.com/in/filmawit-zeru-203272363 Filmawit Zeru]
|GlycoSiteMiner project
|
# BiomarkerKB
# GlyGen
# ARGOS
|-
|[https://www.linkedin.com/in/mathias-belay-03b51a2a3/ Mathias Belay]
|
|
# GlyGen
# PredictMod
# BiomarkerKB
|}

Volunteership 2025

2025-04-11T16:13:48Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

Volunteership 2025

2025-04-11T15:59:13Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

'''The project involves:'''

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

'''The project involves:'''

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

Volunteership 2025

2025-04-11T15:52:53Z

Urnisha.bhuiyan: /* 2. GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>1. BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources.

Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, &quot;human, &quot; &quot;man, &quot; and &quot; homo sapiens&quot; all map to the scientific species name &quot; Homo sapiens.& quot; The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seekingcurators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

'''The project involves:'''

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

'''The project involves:'''

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

Volunteership 2025

2025-04-11T14:37:40Z

Urnisha.bhuiyan: /* GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect

valuable information about glycans, proteins, and their interactions. Some of these databases

have been discontinued due to the end of project funding. However, the data within these

databases remains highly valuable to the community. Integrating these datasets into modern

databases or knowledgebases, such as GlyGen, presents a challenge because much of the

valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do

not align with established standard dictionaries and ontologies used in modern resources.

Automated matching of this information with dictionaries or ontologies is often not possible due

to the use of synonyms, spelling errors, or abbreviations. For example, &quot;human,&quot; &quot;man,&quot; and &quot;h.

sapiens&quot; all map to the scientific species name &quot;Homo sapiens.&quot;

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG)

accessible by migrating the data and metadata into our database. For this project, we are seeking

curators with a medical or biology background who are interested in helping map metadata terms

from these old databases to standard dictionaries and ontologies.

'''The project involves:'''

* Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old
* database.
* Mapping identified terms to corresponding dictionaries and ontologies using the
* webpages and search interfaces of these projects.
* Finding papers based on titles and author lists that may contain spelling errors.
* Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to

rene@ccrc.uga.edu to discuss them.

'''GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community,

how well the project serves this community, and how widely its software/database is used. A

potential solution is to analyze PubMed publication data.

We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

'''The project involves:'''

* Using the PubMed web API to filter publications based on keywords.
* Analyzing paper abstracts to identify research institutions and groups that form the
* community.
* Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or

resource and database names. The results of the analysis will be discussed with GlyGen project

member who will suggest changes and improvements to the analysis and data presentation.

Source code developed as part of this project will be documented and shared in a public GitHub

repository.

If you have any other ideas or methods you would like to explore, please reach out to

rene@ccrc.uga.edu to discuss them.

==== PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

Volunteership 2025

2025-04-11T14:32:12Z

Urnisha.bhuiyan: /* GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect

valuable information about glycans, proteins, and their interactions. Some of these databases

have been discontinued due to the end of project funding. However, the data within these

databases remains highly valuable to the community. Integrating these datasets into modern

databases or knowledgebases, such as GlyGen, presents a challenge because much of the

valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do

not align with established standard dictionaries and ontologies used in modern resources.

Automated matching of this information with dictionaries or ontologies is often not possible due

to the use of synonyms, spelling errors, or abbreviations. For example, &quot;human,&quot; &quot;man,&quot; and &quot;h.

sapiens&quot; all map to the scientific species name &quot;Homo sapiens.&quot;

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG)

accessible by migrating the data and metadata into our database. For this project, we are seeking

curators with a medical or biology background who are interested in helping map metadata terms

from these old databases to standard dictionaries and ontologies.

'''The project involves:'''

* Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old
* database.
* Mapping identified terms to corresponding dictionaries and ontologies using the
* webpages and search interfaces of these projects.
* Finding papers based on titles and author lists that may contain spelling errors.
* Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to

rene@ccrc.uga.edu to discuss them.

'''GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community,

how well the project serves this community, and how widely its software/database is used. A

potential solution is to analyze PubMed publication data.

We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

* Using the PubMed web API to filter publications based on keywords.
* Analyzing paper abstracts to identify research institutions and groups that form the
* community.
* Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or

resource and database names. The results of the analysis will be discussed with GlyGen project

member who will suggest changes and improvements to the analysis and data presentation.

Source code developed as part of this project will be documented and shared in a public GitHub

repository.

If you have any other ideas or methods you would like to explore, please reach out to

rene@ccrc.uga.edu to discuss them.

==== PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

Volunteership 2025

2025-04-11T14:31:13Z

Urnisha.bhuiyan: /* GlyGen Biocuration Project Ideas */

<h2>2025 Volunteer Program Details</h2>

<h3>Dates</h3>
<p><strong>June 2nd, 2025 – July 25th, 2025</strong> (8 weeks)<br>
Monday to Friday | Remote | No breaks</p>

<hr>

<h3>Volunteer Expectations</h3>
<ol>
<li>Daily progress updates via Slack (scrum).</li>
<li>Regular Zoom meetings with the assigned project point of contact.</li><li>Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading. </li>
</ol>
<p style="color: red;"><strong>Important:</strong> If the scrum is not updated for 2 consecutive days, the candidate will be <u>automatically dropped</u> from the program.</p>
<hr>

<h3>Potential Projects</h3>
<ol>
<li>BiomarkerKB ([https://biomarkerkb.org biomarkerkb.org]) project: Biomarker curation project. Involves reading papers and collecting biomarkers.</li>
<li>GlyGen ([https://glygen.org glygen.org]) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information. </li><li>ARGOS ([https://argosdb.org argosdb.org]) project: Analyze genomics data using HIVE to identify reference genome assemblies. </li><li>PredictMod ([https://hivelab.biochemistry.gwu.edu/predictmod hivelab.biochemistry.gwu.edu/predictmod]) project. Identifying datasets and harmonizing them so that they can be used to generate ML models. </li></ol>''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''<hr>

<h4>BiomarkerKB Biocuration Project Ideas</h4>POC: Daniall Masood
# Curate biomarkers for a specific disease (Alzheimers)
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect

valuable information about glycans, proteins, and their interactions. Some of these databases

have been discontinued due to the end of project funding. However, the data within these

databases remains highly valuable to the community. Integrating these datasets into modern

databases or knowledgebases, such as GlyGen, presents a challenge because much of the

valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do

not align with established standard dictionaries and ontologies used in modern resources.

Automated matching of this information with dictionaries or ontologies is often not possible due

to the use of synonyms, spelling errors, or abbreviations. For example, &quot;human,&quot; &quot;man,&quot; and &quot;h.

sapiens&quot; all map to the scientific species name &quot;Homo sapiens.&quot;

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG)

accessible by migrating the data and metadata into our database. For this project, we are seeking

curators with a medical or biology background who are interested in helping map metadata terms

from these old databases to standard dictionaries and ontologies.

'''The project involves:'''

* Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old
* database.
* Mapping identified terms to corresponding dictionaries and ontologies using the
* webpages and search interfaces of these projects.
* Finding papers based on titles and author lists that may contain spelling errors.
* Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to

[[rene@ccrc.uga.edu]] to discuss them.

'''GlyGen Publication Analysis Project'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community,

how well the project serves this community, and how widely its software/database is used. A

potential solution is to analyze PubMed publication data.

We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

* Using the PubMed web API to filter publications based on keywords.
* Analyzing paper abstracts to identify research institutions and groups that form the
* community.
* Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or

resource and database names. The results of the analysis will be discussed with GlyGen project

member who will suggest changes and improvements to the analysis and data presentation.

Source code developed as part of this project will be documented and shared in a public GitHub

repository.

If you have any other ideas or methods you would like to explore, please reach out to

rene@ccrc.uga.edu to discuss them.

==== PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer

Data Identification & Harmonization:

# Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
# Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

# Perform model training and document ML pipeline in a BioCompute Object (BCO).
# Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

'''FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
## Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.<hr>
<h3>Requirements for Completion</h3>
<p><strong>Note:</strong> The following are <u>mandatory</u>. Failure to complete any will result in an incomplete volunteer record.</p>

<h4>Documentation</h4>
<p>All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.</p>

<h4>Written Report</h4>
<p>Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.</p>

<h4>Presentation & Slide Submission</h4>
<p>Present your work last week of the 8-week period.</p>
<p>Slides must be submitted to the Admin Team and should include:</p>
<ul>
<li>A title slide with your name, date, and mentor</li>
<li>At least 3 content slides</li>
<li>A final slide with acknowledgements or references</li>
</ul>
Contact the Admin Team to access previously submitted slides.
<hr>

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
<hr>
=== Contact ===
mazumder_lab@gwu.edu.
<hr>
=== Volunteers ===
{| class="wikitable"
|+
|-
! Name
! Skills
!Projects Interested
|-
| Grace Chong
| Python, Machine Learning, NLP, Analysis & Mathematics
|
# PredictMod
# BiomarkerKB Biocuration
# GlyGen Biocuration
|-
|Alma Ogunsina
|Molecular Biology, Python, ML, and Data Analysis
|
# BiomarkerKB
# ARGOS
# PredictMod
|-
|Diya Kamalabharathy
|Computational Biology, Python Programming,Molecular Biology Techniques
Scientific Writing, Data Analysis
|
# BiomarkerKB Biocuration
# PredictMod Machine Learning
# GlyGen Biocuration
|}

How to Find and Extract Machine-Usable Data from Scientific Literature

2025-03-05T15:24:54Z

Urnisha.bhuiyan: Created and provided information for ML-ready data mining/extractions

Scientific literature contains vast amounts of valuable data, but much of it is not readily machine-readable or structured for computational analysis. Researchers, data scientists, and clinicians often need to extract information from tables, figures, supplementary materials, or even full-text descriptions. However, identifying and obtaining machine-usable data efficiently requires specific strategies.

This guide provides an overview of best practices for locating structured and semi-structured data within research papers, assessing its usability, and extracting it in a format suitable for analysis. It covers key sources of machine-readable data, common challenges in extraction, and tools that can assist in the process. Whether you're working in bioinformatics, clinical research, or any data-driven scientific field, understanding how to find and extract machine-usable data can enhance reproducibility, meta-analyses, and machine learning applications.

=== Steps for Finding Machine-Usable Data in Scientific Literature ===

# '''Start with Reliable Search Engines''' Use databases like '''PubMed, Google Scholar, or Web of Science''' to locate relevant research articles. These platforms provide access to peer-reviewed studies that may contain structured datasets.
# '''Look for Articles with Readily Available Datasets''' Prioritize publications that explicitly provide datasets in '''CSV, TSV, or other structured formats'''. Given time constraints, selecting a dataset that requires minimal preprocessing is crucial.
#* Check sections such as '''Data Availability, Associated Data, or Supplementary Information''', where authors often include links to downloadable datasets.
# '''Ensure the Study Includes a Response Outcome''' For machine learning applications, the dataset must include a clear '''response outcome to a treatment or intervention'''.
#* This is typically stated explicitly in the paper.
#* Search for keywords like '''"responder" or "non-responder"''' within the text. If these terms are absent, the study likely does not contain the data needed.
# '''Consider Sample Size'''
#* While large sample sizes can be difficult to find, aim for datasets with '''at least 100 samples, patients, or observations''' whenever possible.
#* A larger dataset improves the likelihood of developing a robust and generalizable machine learning model.
# '''Prioritize ML-Ready Datasets'''
#* The ideal dataset should require '''minimal modifications''' before being used for machine learning.
#* Avoid datasets with excessive missing values, inconsistent labeling, or complex preprocessing requirements unless you have the resources to clean and structure them efficiently

=== Example of a Machine-Learning-Ready Dataset ===
A well-structured dataset for machine learning should include clearly labeled samples, a defined response variable, and numerical features that require minimal preprocessing. Below is an example of an ML-ready dataset:
{| class="wikitable"
!Sample ID
!Response
!Feature 1
!Feature 2
!Feature 3
!Feature 4
|-
|S001
|R
|0.0884
|0.0481
|0.0317
|0.0026
|-
|S002
|R
|0.2617
|0.0372
|0.0112
|0.0039
|-
|S003
|NR
|0.2280
|0.0408
|0.0449
|0.0136
|-
|S004
|NR
|0.0893
|0.0519
|0.0041
|0.0014
|-
|S005
|NR
|0.2280
|0.0408
|0.0449
|0.0136
|-
|S006
|NR
|0.2156
|0.2961
|0.0015
|0.0046
|-
|S007
|NR
|0.0241
|0.0175
|0.0030
|0.0108
|}

=== Key Features of an ML-Ready Dataset ===

* '''Unique Sample Identifiers''': Each row has a distinct '''Sample ID''', anonymized to maintain privacy.
* '''Clearly Defined Response Variable''': The '''Response''' column indicates whether a sample is a '''Responder (R)''' or '''Non-Responder (NR)'''.
* '''Numerical Features''': Feature values are already in a numeric format, making them immediately usable for machine learning models.
* '''Minimal Preprocessing Required''': The data does not require extensive cleaning or transformation, ensuring an efficient pipeline for ML training.

This structure facilitates smooth integration into predictive modeling workflows while ensuring reproducibility and interpretability.