Volunteership 2025: Difference between revisions

Revision as of 14:31, 11 April 2025

2025 Volunteer Program Details

Dates

June 2nd, 2025 – July 25th, 2025 (8 weeks)
Monday to Friday | Remote | No breaks

Volunteer Expectations

Daily progress updates via Slack (scrum).
Regular Zoom meetings with the assigned project point of contact.
Expected to dedicate 5–6 hours per day to project work, with the remaining time focused on skill development or reading.

Important: If the scrum is not updated for 2 consecutive days, the candidate will be automatically dropped from the program.

Potential Projects

BiomarkerKB (biomarkerkb.org) project: Biomarker curation project. Involves reading papers and collecting biomarkers.
GlyGen (glygen.org) project: Review glycomics and glycoproteomics data and curate tissue, disease, and other related information.
ARGOS (argosdb.org) project: Analyze genomics data using HIVE to identify reference genome assemblies.
PredictMod (hivelab.biochemistry.gwu.edu/predictmod) project. Identifying datasets and harmonizing them so that they can be used to generate ML models.

Note: Individuals involved in the above projects with a background in programming and/or machine learning may also undertake additional tasks to support the development of ML models, which can be integrated into PredictMod or used to enhance AI/ML-ready datasets within GlyGen.

BiomarkerKB Biocuration Project Ideas

POC: Daniall Masood

Curate biomarkers for a specific disease (Alzheimers)
1. The student would be doing manual curation for about 4 weeks, with regular check-ins with me to ensure it is being done correctly.
2. The next 4 weeks can be dedicated to developing an LLM or an automated process to extract biomarker details with data collected in the first 4 weeks as training data/example data.
Top 50 biomarkers
1. Curate the top 50 biomarkers for biomarkerkb.org.
2. Define what constitutes a top 50 biomarker.
3. Begin curating biomarkers from different sources and papers by collecting fields mentioned in the data model, as well as collecting cross-references.
Biocuration of biomarkers from NLP/LLM work
1. Use the biomarkers collected from NLP work.
2. Curate biomarkers. Data provided was not provided in the biomarker data model.
3. While curating the biomarkers, check if data collected from NLP is correct.
4. After completion, the student can start using curated data to work on the NLP/LLM method.
Curate biomarkers for a treatment
1. See #1 above.

If the student has any other ideas, diseases, treatments, or methods they want to focus on, please reach out to daniallmasood@gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

GlyGen Biocuration Project Ideas

POC: Rene Ranzinger and Urnisha Bhuiyan

Over the last three decades, numerous glycomics database projects have been initiated to collect

valuable information about glycans, proteins, and their interactions. Some of these databases

have been discontinued due to the end of project funding. However, the data within these

databases remains highly valuable to the community. Integrating these datasets into modern

databases or knowledgebases, such as GlyGen, presents a challenge because much of the

valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do

not align with established standard dictionaries and ontologies used in modern resources.

Automated matching of this information with dictionaries or ontologies is often not possible due

to the use of synonyms, spelling errors, or abbreviations. For example, "human," "man," and "h.

sapiens" all map to the scientific species name "Homo sapiens."

The GlyGen project aims to make datasets from two older databases (CarbBank, CFG)

accessible by migrating the data and metadata into our database. For this project, we are seeking

curators with a medical or biology background who are interested in helping map metadata terms

from these old databases to standard dictionaries and ontologies.

The project involves:

Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old
database.
Mapping identified terms to corresponding dictionaries and ontologies using the
webpages and search interfaces of these projects.
Finding papers based on titles and author lists that may contain spelling errors.
Interacting and discussing with other curators in case terms are mapped differently.

If you have any other ideas or methods you would like to focus on, please reach out to

rene@ccrc.uga.edu to discuss them.

GlyGen Publication Analysis Project

POC: Rene Ranzinger and Urnisha Bhuiyan

One of the challenges for any bioinformatics project is understanding the size of its community,

how well the project serves this community, and how widely its software/database is used. A

potential solution is to analyze PubMed publication data.

We are seeking applicants with programming skills (in Python or Java) to perform this analysis.

The project involves:

Using the PubMed web API to filter publications based on keywords.
Analyzing paper abstracts to identify research institutions and groups that form the
community.
Filtering the community list to exclude unrelated co-authors.

A subproject will involve analyzing the full text of papers (when available) for keywords or

resource and database names. The results of the analysis will be discussed with GlyGen project

member who will suggest changes and improvements to the analysis and data presentation.

Source code developed as part of this project will be documented and shared in a public GitHub

repository.

If you have any other ideas or methods you would like to explore, please reach out to

rene@ccrc.uga.edu to discuss them.

PredictMod Machine Learning Project Ideas

POC: Lori Krammer

Data Identification & Harmonization:

Identify publicly-available datasets from scientific literature that can be used for intervention outcome prediction models.
Conduct data harmonization and pre-processing following established project pipelines to make ML-ready dataset and data dictionary.

Modeling & Integration (for those with experience in programming/ML)

Perform model training and document ML pipeline in a BioCompute Object (BCO).
Integrate model into PredictMod platform.

Individuals with a background or interest in machine learning should reach out to lorikrammer@gwu.edu with a potential dataset to determine if it is a feasible project for the summer.

FDA-ARGOS Computation and Pathogen Curation Project

POC: Christie Woodside

Update data tables for more efficient computations
1. Student would review and input additional data and IDs in the tables/sheets used to perform computations. This would be manual work (but super important), but would require high attention to detail. ~1 week's worth of work
2. Requires Python/shell coding background. Student would run scripts that prepare and format data tables that are pushed to data.argosdb.org. Coding knowledge is needed in case of errors, bugs, or other mishaps in the code. Ongoing work as computations are performed.
Curate and report on current pathogens to upload to ARGOS
1. Student would work on manual curation of circulating pathogens to be added to data.argosdb.org. Regular check-ins and reports of what was found. ~4-10 weeks worth of work
2. Locate assembly IDs, reads, and metagenomic information for these pathogens to be used in computations and deposited into data.argosdb.org.
3. Provide documentation on why they were curated, why they are important, how they were selected, and how data was collected.

If the student has any other ideas or methods they want to focus on, please reach out to christie.woodside@email.gwu.edu to discuss your idea and check if it will be feasible as a project for the summer.

Requirements for Completion

Note: The following are mandatory. Failure to complete any will result in an incomplete volunteer record.

Documentation

All volunteers must maintain adequate documentation of their work, including written protocols and scripts submitted to GitHub.

Written Report

Submit a 1–2 page summary of your tasks and accomplishments to the Admin during the final week of your program.

Presentation & Slide Submission

Present your work last week of the 8-week period.

Slides must be submitted to the Admin Team and should include:

A title slide with your name, date, and mentor
At least 3 content slides
A final slide with acknowledgements or references

Contact the Admin Team to access previously submitted slides.

Completion Certificate

A certificate of completion and a letter of recommendation will be provided to all participants who successfully complete the program.

Contact

mazumder_lab@gwu.edu.

Volunteers


Name	Skills	Projects Interested
Grace Chong	Python, Machine Learning, NLP, Analysis & Mathematics	PredictMod BiomarkerKB Biocuration GlyGen Biocuration
Alma Ogunsina	Molecular Biology, Python, ML, and Data Analysis	BiomarkerKB ARGOS PredictMod
Diya Kamalabharathy	Computational Biology, Python Programming,Molecular Biology Techniques Scientific Writing, Data Analysis	BiomarkerKB Biocuration PredictMod Machine Learning GlyGen Biocuration

Volunteership 2025: Difference between revisions

Revision as of 14:31, 11 April 2025

Contents

2025 Volunteer Program Details

Dates

Volunteer Expectations

Potential Projects

BiomarkerKB Biocuration Project Ideas

GlyGen Biocuration Project Ideas

PredictMod Machine Learning Project Ideas

Requirements for Completion

Documentation

Written Report

Presentation & Slide Submission

Completion Certificate

Contact

Volunteers

Navigation menu

@@ Line 41: / Line 41: @@
 POC: Rene Ranzinger and Urnisha Bhuiyan
-Using TableMaker in GlyGen, individuals will curate glycomics and glycoproteomics data from previous database resources that are now defunct. There might also be biocuration projects that inolve curating papers.
+Over the last three decades, numerous glycomics database projects have been initiated to collect
+valuable information about glycans, proteins, and their interactions. Some of these databases
+have been discontinued due to the end of project funding. However, the data within these
+databases remains highly valuable to the community. Integrating these datasets into modern
+databases or knowledgebases, such as GlyGen, presents a challenge because much of the
+valuable metadata (e.g., species, tissue, disease, cell line) annotations are free-text terms that do
+not align with established standard dictionaries and ontologies used in modern resources.
+Automated matching of this information with dictionaries or ontologies is often not possible due
+to the use of synonyms, spelling errors, or abbreviations. For example, &amp;quot;human,&amp;quot; &amp;quot;man,&amp;quot; and &amp;quot;h.
+sapiens&amp;quot; all map to the scientific species name &amp;quot;Homo sapiens.&amp;quot;
+The GlyGen project aims to make datasets from two older databases (CarbBank, CFG)
+accessible by migrating the data and metadata into our database. For this project, we are seeking
+curators with a medical or biology background who are interested in helping map metadata terms
+from these old databases to standard dictionaries and ontologies.
+'''The project involves:'''
+* Using internet resources (e.g., Google, Wikipedia) to identify terms used in the old
+* database.
+* Mapping identified terms to corresponding dictionaries and ontologies using the
+* webpages and search interfaces of these projects.
+* Finding papers based on titles and author lists that may contain spelling errors.
+* Interacting and discussing with other curators in case terms are mapped differently.
+If you have any other ideas or methods you would like to focus on, please reach out to
+[[rene@ccrc.uga.edu]] to discuss them.
+'''GlyGen Publication Analysis Project'''
+POC: Rene Ranzinger and Urnisha Bhuiyan
+One of the challenges for any bioinformatics project is understanding the size of its community,
+how well the project serves this community, and how widely its software/database is used. A
+potential solution is to analyze PubMed publication data.
+We are seeking applicants with programming skills (in Python or Java) to perform this analysis.
+The project involves:
+* Using the PubMed web API to filter publications based on keywords.
+* Analyzing paper abstracts to identify research institutions and groups that form the
+* community.
+* Filtering the community list to exclude unrelated co-authors.
+A subproject will involve analyzing the full text of papers (when available) for keywords or
+resource and database names. The results of the analysis will be discussed with GlyGen project
+member who will suggest changes and improvements to the analysis and data presentation.
+Source code developed as part of this project will be documented and shared in a public GitHub
+repository.
+If you have any other ideas or methods you would like to explore, please reach out to
+rene@ccrc.uga.edu to discuss them.
 ==== PredictMod Machine Learning Project Ideas ====

Volunteership 2025: Difference between revisions

Revision as of 14:31, 11 April 2025

2025 Volunteer Program Details

Dates

Volunteer Expectations

Potential Projects

BiomarkerKB Biocuration Project Ideas

GlyGen Biocuration Project Ideas

PredictMod Machine Learning Project Ideas

Requirements for Completion

Documentation

Written Report

Presentation & Slide Submission

Completion Certificate

Contact

Volunteers

Navigation menu

Search