BioMuta pipeline README

From HIVE Lab
Revision as of 16:23, 2 October 2024 by Hivelabwikiadmin (talk | contribs) (Added sharepoint link)
Jump to navigation Jump to search
Under construction

This article is still under construction and should not be nominated for deletion.

Originally updated by Ned Cauley (August 2022); currently maintained by Maria Kim (September 2024).

This page will contain an updated version of this BioMuta documentation page.

Description

The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.

The sources included in BioMuta are:

BioMuta gathers mutation data for the following cancers:

  • Urinary Bladder Cancer (DOID:11054)
  • Breast Cancer (DOID:1612)
  • Colorectal (DOID:9256)
  • Esophageal Cancer (DOID:5041)
  • Head and Neck Cancer (DOID:11934)
  • Kidney Cancer (DOID:263)
  • Liver Cancer (DOID:3571)
  • Lung Cancer (DOID:1324)
  • Prostate Cancer (DOID:10283)
  • Stomach Cancer (DOID:10534)
  • Thyroid Gland Cancer (DOID:1781)
  • Uterine Cancer (DOID:363)
  • Cervical Cancer (DOID:4362)
  • Brain Cancer (DOID:1319)
  • Hematologic Cancer (DOID:2531)
  • Adrenal Gland Cancer (DOID:3953)
  • Pancreatic Cancer (DOID:1793)
  • Ovarian Cancer (DOID:2394)
  • Skin Cancer (DOID:4159)

Running the Pipeline

To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.

Pipeline Overview

Step 1: Download

In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions.

Download: TCGA

Annotated variant files are downloaded from the ISB-CGC Big Query repository.

Field descriptions for Big Query output available in field_names_descriptions.csv. Additional field descriptions available on GDC docs.

The list of studies used in TCGA can be found here: List of TCGA studies.

Gain access to data

There are two parts to obtaining data from TCGA:

1. Primary TCGA data

2. TCGA controlled-access data

  • Hosted at dbGaP. For information on how to get access, see Sharepoint.

Run downloader R script using R Studio

Required script: TCGA_mutation_download.R

Run each line one after the other, instead of the whole script at once.

Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with Google credentials.

  • Use the Google account registered for a ISB-CGC project and with dbGaP authorization.
  • After logging in, a token will be saved so that you can login through R Studio instead.

This script will download all mutation data for TCGA.

Since the downloaded file is very large, there might be issues running this script. If this is the case, run the following scripts in the tcga folder:

These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.

Additional information can be found here: TCGA Additional Information.

Download: CIViC

A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main

Download: COSMIC

There are three COSMIC mutation datasets for coding mutations:

  • COSMIC Complete Mutation Data (Targeted Screens)
A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.
  • COSMIC Mutation Data (Genome Screens)
A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).
  • COSMIC Mutations Data
A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.

The COSMIC Mutations Data set was chosen because it combines both the Targeted and Genome Screens.

Downloaded File: COSMIC_SNPs_June_2022.tsv

NOTE Downloading the mutation datasets requires a COSMIC login. With an academic email address, an account can be created for free and the download can be performed.

Step 2: Convert

In the convert step, all resources are formatted to the Biomuta standard for both data and field structure.

Step 3: Combine

In the combined step, all resources are combined into a master dataset.