HIVE Lab - User contributions [en]

Volunteership Fall 2025

2025-12-02T02:36:24Z

Twang9: /* Fall Symposium */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_Spring_2026 Spring 2026 Volunteership]
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|[https://www.linkedin.com/in/arhamur-rauf-2a61b3156 Arhamur Rauf]
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro (Urnisha, Rene, Kate)
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (Christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall & Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-12-01T19:51:02Z

Twang9: /* Fall Symposium */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_Spring_2026 Spring 2026 Volunteership]
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|[https://www.linkedin.com/in/arhamur-rauf-2a61b3156 Arhamur Rauf]
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

'''Recording''': https://gwu-edu.zoom.us/rec/share/PxqcaIac-skXkpyN4kgz98WyxPGATabT5pRRDqV-OPs0XxxtQHLeNmgkt2TLGNBZ.s4PNqHOLinZiRAxd

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro (Urnisha, Rene, Kate)
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (Christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall & Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-12-01T19:50:17Z

Twang9:

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_Spring_2026 Spring 2026 Volunteership]
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|[https://www.linkedin.com/in/arhamur-rauf-2a61b3156 Arhamur Rauf]
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

'''Recording''': <nowiki>https://gwu-edu.zoom.us/rec/share/PxqcaIac-skXkpyN4kgz98WyxPGATabT5pRRDqV-OPs0XxxtQHLeNmgkt2TLGNBZ.s4PNqHOLinZiRAxd</nowiki>

Passcode: W!9J=J61

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3:00-3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10-3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35-4:00 PM
|GlyGen
|
* 5 min POC intro (Urnisha, Rene, Kate)
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00-4:25 PM
|ARGOS
|
* 5 min POC (Christie) intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25-4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall & Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran; Sparsh Gupta
|-
|4:50-5:00 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-25T16:54:45Z

Twang9: /* Agenda (All times are in Eastern Standard Time) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3 - 3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10 - 3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35 - 4:00 PM
|GlyGen
|
* 5 min POC intro
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00 - 4:25 PM
|Argos
|
* 5 min POC intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf
|-
|4:25 - 4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall and Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran
|-
|4: 50 - 5 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-25T16:00:30Z

Twang9: /* Agenda (All times are in Eastern Standard Time) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3 - 3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10 - 3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PMID Curation for Intervention Outcome Prediction Models (IOPMs)
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35 - 4:00 PM
|GlyGen
|
* 5 min POC intro
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00 - 4:25 PM
|Argos
|
* 5 min POC intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf; Linford
|-
|4:25 - 4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall and Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran
|-
|4: 50 - 5 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-13T17:03:23Z

Twang9: /* Agenda (All times are in Eastern Standard Time) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|[https://www.linkedin.com/in/isil-erbasol-serbes/ Isil Erbasol Serbes]
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB, ARGOS
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3 - 3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10 - 3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PubMed Curation for Training an LLM for Recommendation
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35 - 4:00 PM
|GlyGen
|
* 5 min POC intro
* 15 mins - Curation of species metadata using LLM & Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00 - 4:25 PM
|Argos
|
* 5 min POC intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf; Linford
|-
|4:25 - 4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall and Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran
|-
|4: 50 - 5 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Spring 2026

2025-11-13T16:24:57Z

Twang9:

== 2026 Spring Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

January 9, 2026, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

January 12, 2026 | 4:00 to 5:00 PM

'''Program Dates: January, 2026 – April, 2026''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[[Volunteership Fall 2025|Fall 2025 Volunteership]] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.

# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|
|
|
|
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-11-07T22:19:17Z

Twang9: /* Agenda (All times are in Eastern Standard Time) */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|[https://www.linkedin.com/in/isil-erbasol-serbes/ Isil Erbasol Serbes]
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB, ARGOS
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===
{| class="wikitable"
|+
!Time
!Project
!Presentation Title
!Presenter(s)
|-
|3 - 3:10 PM
| colspan="2" |Welcome & Introduction
|Raja Mazumder
|-
|3:10 - 3:35 PM
|PredictMod
|
* 5 min POC (Tianyi & Lori) intro
* 15 mins - PredictMod: PubMed Curation for Training an LLM for Recommendation
* 5 min QA
|Diya Kamalabharathy; Anika Sikka; Ashley Tien; Farah Kamila
|-
|3:35 - 4:00 PM
|GlyGen
|
* 5 min POC intro
* 15 mins - Curation of species metadata using LLM Visualizing glycomics databases and their features
* 5 min QA
|Diya Kamalabharathy; Harivinay P. Gujjula
|-
|4:00 - 4:25 PM
|Argos
|
* 5 min POC intro
* 15 mins -Curation of Pathogens and QC Analysis for the Argos Project QC analysis, representative genome selection Curation of genomes 1 & 2
* 5 mins QA
|Miao Wang; Arhamur Rauf; Linford
|-
|4:25 - 4:50 PM
|BiomarkerKB
|
* 5 min POC (Daniall and Maria) intro
* 15 mins - Leveraging Large Language Models to collect Biomarker data from PubMed Abstracts
* 5 mins QA
|Namrata Oruganti; Vishal Muthusekaran
|-
|4: 50 - 5 PM
| colspan="2" |Remarks
|Raja Mazumder
|}

Volunteership Fall 2025

2025-11-07T22:11:32Z

Twang9:

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|[https://www.linkedin.com/in/isil-erbasol-serbes/ Isil Erbasol Serbes]
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB, ARGOS
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

== Fall Symposium ==
The Fall symposium will be held virtually.

'''Date:''' Nov 26th, 2025 (Wednesday)

'''Time:''' 3 - 5 PM

'''Zoom Link''' - https://gwu-edu.zoom.us/j/96518488501?jst=2

=== Agenda (All times are in Eastern Standard Time) ===

Volunteership Fall 2025

2025-10-27T15:59:17Z

Twang9: /* Volunteers */ Deleted Ramtin

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB, ARGOS
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Robert Ziebich
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

Volunteership Fall 2025

2025-09-16T14:45:35Z

Twang9: /* Volunteers */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|[https://www.linkedin.com/in/diya-kamalabharathy-62557935a/ Diya Kamalabharathy*]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula*
|GlyGen
|Rene Ranzinger, Urnisha Bhuiyan, Kate Warner
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS
|-
|Anika Sikka
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|[https://www.linkedin.com/in/farah-kamila/ Farah Kamila]
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, ARGOS, BiomarkerKB
|-
|Robert Ziebich
|BiomarkerKB
|Daniall Masood, Maria Kim
|PredictMod, BiomarkerKB
|-
|Arhamur Rauf
|ARGOS
|Christie Woodside, Jonathon Keeney
|ARGOS, GlyGen, PredictMod
|-
|[https://www.linkedin.com/in/ashley-tien/ Ashley Tien]
|PredictMod
|Lori Krammer, Tianyi Wang
|ARGOS, PredictMod
|-
|Namrata Oruganti
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|}
<nowiki>*</nowiki>Returning volunteer.

PredictMod Automated Pipeline

2025-08-27T15:32:15Z

Twang9: added file type instruction

← <small>Go Back to the [[PredictMod|PredictMod Home Page]].</small>

== Overview ==
We have created an [https://hivelab.biochemistry.gwu.edu/predictmod/automated-pipeline automated model training pipeline] for researchers who have a significant collection of data without the programming expertise required to create and train such models. Such researchers may use this automated pipeline to directly upload data.

Upon upload of training data, the PredictMod platform performs several consecutive steps. Data are first inspected for general suitability for modelling, with typical errors such as missing values flagged for the user to edit as appropriate.

Next, given an appropriately-formatted data set that has all such errors removed, the platform provides the user results from several ML and statistical algorithms to provide both additional insight into the data and intervention outcome prediction models for use with future data points. Specifically, the user is given Principal Component Analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) clustering outputs as a means for the user to explore for clear patterns in the data or other such low-hanging fruit. The user is also provided trained models, and corresponding confusion matrices and Receiver Operator Characteristic (ROC) curve analysis for a diverse family of ML models, including Random Forest, Decision Tree Classifiers, Support Vector Machines, Logistic Regression, and Boosting algorithms. The trained models are retained within the PredictMod platform for the user to further introspect, use with additional data points, download, or eventually publish to the broader PredictMod ecosystem.

Additional resources for each algorithm are provided to the user for each resulting model in the form of links to both algorithm descriptions and original source references that describe the underlying algorithms.

== Usage Guide ==

# Login to the [https://hivelab.biochemistry.gwu.edu/predictmod/ PredictMod Platform]
# Navigate to "More" -> "Automated Pipeline"
# Train a model
## The model needs a name to group resulting models under.
## There needs to be a response column - this is the column in training data containing ground-truth labels (often "R", "NR"; the column is often named "Status" or "Response")
## There can be/should be columns to drop, such as sample names, etc.
## Note: only CSV and Excel (under 13MB) files are supported. Any other file types such as Numbers, etc, are not supported.
# Once trained, click on "New sample with training model"
## Select a family of models as needed
## Choose models to sample
## Upload the new data point

Volunteership Fall 2025

2025-08-21T17:26:22Z

Twang9: /* 4. PredictMod Machine Learning Project Ideas */

== 2025 Volunteer Program Details ==

=== Dates ===
'''Application Deadline'''

August 22, 2025, Noon (email your updated resume and projects in order of preference). Acceptance letter/email will be sent to candidates latest the day after the kick-off meeting.

'''Volunteer Zoom Kick-Off Meeting'''

August 25, 2025 | 4:00 to 5:00 PM

'''Program Dates: September 1st, 2025 – November 30th, 2025''' (13 weeks)

Remote | Hybrid for GW employees and students (Ross Hall 5th floor)

[https://hivelab.biochemistry.gwu.edu/wiki/Volunteership_2025 Summer 2025 Volunteership] (Closed)
----

=== Volunteer Expectations ===

# Minimum commitment of 10 hours per week.
# Progress updates via Slack at least 3 days per week (scrum).
# Regular Zoom meetings with the assigned project point of contact.
# Attend some lectures or seminars remotely (max 4-5).

'''''Important:''' If the scrum is not updated for 2 consecutive working days, the candidate will be automatically dropped from the program.''
----

=== Potential Projects ===
We are excited to continue our bioinformatics volunteership program in Fall 2025. This program offers students the opportunity to work on bioinformatics projects supported by agencies such as the NIH, ARPA-H, and FDA. Participants will gain exposure to a variety of activities within a bioinformatics lab, including data analysis, computational biology, and genomics. If you are interested, please email mazumder_lab@gwu.edu your resume and a ranked list of the projects that interest you most. You can also indicate if you want to focus on specific areas that are of interest to you.
# BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
# GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
# ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
# PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Curating PMIDs for intervention outcome prediction dataset LLM recommendation training.

''Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.''
----

==== 1. BiomarkerKB Biocuration Project Ideas ====
POC: Daniall Masood, Maria Kim

# Curate biomarkers for a specific disease
## The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
## The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
# Top 50 biomarkers
## Curate the top 50 biomarkers for biomarkerkb.org.
## Define what constitutes a top 50 biomarker.
## Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
# Biocuration of biomarkers from NLP/LLM work
## Use the biomarkers collected from NLP work.
## Curate biomarkers. Data provided was not provided in the biomarker data model.
## While curating the biomarkers, check if data collected from NLP is correct.
## After completion, the student can start using curated data to work on the NLP/LLM method.
# Curate biomarkers for a treatment
## See #1 above.
# Continue working on LLM methods started by volunteers over the summer.
## The data is available as well as some preliminary research and work done by previous volunteers in this area.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

==== 2. GlyGen Biocuration Project Ideas ====
POC: Rene Ranzinger, Urnisha Bhuiyan, Kate Warner

Over the last three decades, numerous glycomics database projects have been initiated to collect valuable information about glycans, proteins, and their interactions. Some of these databases have been discontinued due to the end of project funding. However, the data within these databases remains highly valuable to the community. Integrating these datasets into modern databases or knowledgebases, such as GlyGen, presents a challenge because much of the valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do not align with established standard dictionaries and ontologies used in modern resources. Automated matching of this information with dictionaries or ontologies is often not possible due to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h. sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG) accessible by migrating the data and metadata into our database. For this project, we are seeking curators with a medical or biology background who are interested in helping map metadata terms from these old databases to standard dictionaries and ontologies.

The project involves:

# Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old database.
# Mapping identified terms to corresponding dictionaries and ontologies using the webpages and search interfaces of these projects.
# Finding papers based on titles and author lists that may contain spelling errors.
# Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to rene@ccrc.uga.edu to discuss them.

'''3. GlyGen Publication Analysis Project Ideas'''

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community, how well the project serves this community, and how widely its software/database is used. A potential solution is to analyze PubMed publication data. We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

# Using the PubMed web API to filter publications based on keywords.
# Analyzing paper abstracts to identify research institutions and groups that form the community.
# Filtering the community list to exclude unrelated co-authors.
# Prioritize papers identified by GlycoSiteMiner for curation via TableMaker

A subproject will involve analyzing the full text of papers (when available) for keywords or resource and database names. The results of the analysis will be discussed with GlyGen project member who will suggest changes and improvements to the analysis and data presentation. Source code developed as part of this project will be documented and shared in a public GitHub repository. If you have any other ideas or methods you would like to explore, please reach out to rene@ccrc.uga.edu to discuss them.

==== 4. PredictMod Machine Learning Project Ideas ====
POC: Lori Krammer, Tianyi Wang, Pat McNeely (optional)

Identifying relevant and useful publicly-available datasets for machine learning is currently a resource-intensive task. This curation project aims to develop a corpus for training an AI model to recommend PMIDs with publicly-available datasets useful for intervention outcome prediction models. The corpus will include an annotation spreadsheet + annotated PDFs for PubMed articles relevant to prostate, lung, breast cancers, biomarkers and glycans, and focus on indicators such as condition, intervention, and response.

PMID curation involves:

# Identify potentially relevant PMIDs that may have publicly-available datasets for training intervention outcome prediction models.
# Curate indicators of useful ML publications that could be used to train an LLM to recommend relevant publications for cancer modeling.
# Review peer curations and resolve annotation conflicts.
# Prepare a Wikipage to showcase the validated PMIDs.

Interested individuals should reach out to lorikrammer@gwu.edu.

'''5. FDA-ARGOS Computation and Pathogen Curation Project'''

POC: Christie Woodside, Jonathon Keeney

# Update data tables for more efficient computations
## Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail.
## Additional Work: Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
# Curate and report on current pathogens to upload to ARGOS
## Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found.
## Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
## Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.
# QC Analysis using HIVE
## Analyze the curated pathogens using our QC ARGOS one-click pipeline.
## The results will be added to our ARGOS database.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the Fall.
----

=== Requirements for Completion ===
'''Note:''' The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

==== Documentation ====
All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

==== Written Report ====
Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

==== Presentation & Slide Submission ====
Present your work last week of the 13-week period.

Slides must be submitted to the POCs.
----

=== Completion Certificate ===
A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.
----

=== Contact ===
mazumder_lab@gwu.edu.
----

=== Volunteers (TBD) ===
{| class="wikitable"
|+
!Name
!Project Assigned
!POC Assigned
!Projects Interested
|-
|Diya Kamalabharathy*
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod; Glyco web development
|-
|Anika Sikka
|
|
|GlyGen
|-
|Akale Kinfe*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Nahom Abel*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Harivinay P. Gujjula
|
|
|GlyGen
|-
|Sparsh Gupta*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Mathias Belay*
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Isil Erbasol Serbes
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod, BiomarkerKB, ARGOS
|-
|Ramtin Mashhoon
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Anagha Kalle
|PredictMod
|Lori Krammer, Tianyi Wang
|PredictMod
|-
|Vishal Muthusekaran
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Adonay Awet
|BiomarkerKB
|Daniall Masood, Maria Kim
|BiomarkerKB
|-
|Miao Wang*
|
|
|ARGOS
|}
<nowiki>*</nowiki>Returning volunteer.

GW-FEAST

2025-03-04T21:24:56Z

Twang9: /* GW-FEAST Project Architecture */

Federated Ecosystems for Analytics and Standardized Technologies ([https://hivelab.biochemistry.gwu.edu/gw-feast FEAST]) is a cloud-based, agile bioinformatics and data analysis platform under development through the ARPA-H Biomedical Data Fabric (BDF) toolbox program. The project is led by [https://dnahive.com DNA-HIVE] and other funded collaborators include Cornell University, Vanderbilt University, Georgetown University, European Bioinformatic Institute, and Kaiser Permanente. Our team is responsible for the GW instance of FEAST (GW-FEAST) and for co-leading the project with DNA-HIVE. This project is part of the ARPA-H FEAST performer team initiative to create bridges across data silos and make health data more accessible and usable.

Several hospitals and cancer centers will have a FEAST platform, which enables cross-site data analysis without the need to export or transform the data. Currently, large chunks of data are used by insurance companies, pharmaceutical companies, and others for research and development purposes. The FEAST platform, which is particularly strong with noisy, real-world data, aims to enable more precise data selection for research use while preserving patient privacy. When clinical data is submitted to the suite of tools, submission is handled via the HL7 FHIR protocol, ensuring only authorized parties ever have access to protected data. Models that provide update mechanisms such as online training will be updated appropriately without retaining any personally identifiable information (PII). Thus, these tools support federated data sets and training without ever retaining clinical PII within the system. All services are treated as independent microservices through containerization within docker containers.

[https://drive.google.com/file/d/1iv9VmFhNbd-5iwSwDMLVumCFnFN84cl8/view?usp=drive_link FEAST Video]

=== GW-FEAST Project Architecture ===
[[File:GW-FEAST_architecture_v1.1.jpg|center|frameless|950x950px]]

This figure depicts the current GW instance of FEAST and is subject to change throughout the life of the project.

==== [[GW-FEAST Data Access Portal]] ====
GW-FEAST data is access-controlled. To gain access please email mazumder_lab@gwu.edu.
For users who have access please connect to GW VPN and then go to the [https://feast.mgpc.biochemistry.gwu.edu/dsviewer '''data access portal'''] and log in.