BioMuta pipeline README
This article is still under construction and should not be nominated for deletion.
This page will contain an updated version of this BioMuta documentation page.
Description
The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.
The sources included in BioMuta are:
- The Cancer Genome Atlas (TCGA)
- Clinical Interpretation of Variants in Cancer (CIVIC)
- Catalogue of Somatic Mutations in Cancer (COSMIC)
- International Cancer Genome Consortium (ICGC) (retired)
BioMuta gathers mutation data for the following cancers:
- Urinary Bladder Cancer (DOID:11054)
- Breast Cancer (DOID:1612)
- Colorectal (DOID:9256)
- Esophageal Cancer (DOID:5041)
- Head and Neck Cancer (DOID:11934)
- Kidney Cancer (DOID:263)
- Liver Cancer (DOID:3571)
- Lung Cancer (DOID:1324)
- Prostate Cancer (DOID:10283)
- Stomach Cancer (DOID:10534)
- Thyroid Gland Cancer (DOID:1781)
- Uterine Cancer (DOID:363)
- Cervical Cancer (DOID:4362)
- Brain Cancer (DOID:1319)
- Hematologic Cancer (DOID:2531)
- Adrenal Gland Cancer (DOID:3953)
- Pancreatic Cancer (DOID:1793)
- Ovarian Cancer (DOID:2394)
- Skin Cancer (DOID:4159)
Running the Pipeline
To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.
Pipeline Overview
Step 1: Download
In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions.
Download: TCGA
Gain access to data
There are two parts to obtaining data from TCGA:
1. Primary TCGA data
- Available at NCI Genomic Data Commons.
- Accessible through the ISB-CGC Big Query repository (more info TBA).
2. TCGA controlled-access data
- Hosted at dbGaP. For access, see Sharepoint (link TBA).
Run downloader R script using R Studio
Required script: TCGA_mutation_download.R
Run each line one after the other, instead of the whole script at once.
Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with Google credentials.
- Use the Google account registered for a ISB-CGC project and with dbGaP authorization.
- After logging in, a token will be saved so that you can login through R Studio instead.
This script will download all mutation data for TCGA.
Since the downloaded file is very large, there might be issues running this script. If this is the case, run the following scripts in the tcga folder:
- TCGA_mutation_download_part1.R
- TCGA_mutation_download_part2.R
- TCGA_mutation_download_part3.R
- TCGA_mutation_download_part4.R
These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.
Download: CIViC
Download: COSMIC
Step 2: Convert
In the convert step, all resources are formatted to the Biomuta standard for both data and field structure.
Step 3: Combine
In the combined step, all resources are combined into a master dataset.