BioMuta pipeline README
This article is still under construction and should not be nominated for deletion.
This page will contain an updated version of this BioMuta documentation page.
Description
The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.
The sources included in BioMuta are:
- The Cancer Genome Atlas (TCGA)
- Clinical Interpretation of Variants in Cancer (CIVIC)
- Catalogue of Somatic Mutations in Cancer (COSMIC)
- International Cancer Genome Consortium (ICGC) (retired)
BioMuta gathers mutation data for the following cancers:
- Urinary Bladder Cancer (DOID:11054)
- Breast Cancer (DOID:1612)
- Colorectal (DOID:9256)
- Esophageal Cancer (DOID:5041)
- Head and Neck Cancer (DOID:11934)
- Kidney Cancer (DOID:263)
- Liver Cancer (DOID:3571)
- Lung Cancer (DOID:1324)
- Prostate Cancer (DOID:10283)
- Stomach Cancer (DOID:10534)
- Thyroid Gland Cancer (DOID:1781)
- Uterine Cancer (DOID:363)
- Cervical Cancer (DOID:4362)
- Brain Cancer (DOID:1319)
- Hematologic Cancer (DOID:2531)
- Adrenal Gland Cancer (DOID:3953)
- Pancreatic Cancer (DOID:1793)
- Ovarian Cancer (DOID:2394)
- Skin Cancer (DOID:4159)
Running the Pipeline
To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.
Pipeline Overview
Step 1: Download
In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions.
Download: TCGA
Annotated variant files are downloaded from the ISB-CGC Big Query repository.
Field descriptions for Big Query output available in field_names_descriptions.csv. Additional field descriptions available on GDC docs.
The list of studies used in TCGA can be found here: List of TCGA studies.
Gain access to data
There are two parts to obtaining data from TCGA:
1. Primary TCGA data
- Available at NCI Genomic Data Commons.
- Accessible through the ISB-CGC Big Query repository (more info TBA). For complete documentation, see the ISB-CGC Read the Docs pages.
2. TCGA controlled-access data
- Hosted at dbGaP. For information on how to get access, see Sharepoint.
Run downloader R script using R Studio
Required script: TCGA_mutation_download.R
Run each line one after the other, instead of the whole script at once.
Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with Google credentials.
- Use the Google account registered for a ISB-CGC project and with dbGaP authorization.
- After logging in, a token will be saved so that you can login through R Studio instead.
This script will download all mutation data for TCGA.
Since the downloaded file is very large, there might be issues running this script. If this is the case, run the following scripts in the tcga folder:
- TCGA_mutation_download_part1.R
- TCGA_mutation_download_part2.R
- TCGA_mutation_download_part3.R
- TCGA_mutation_download_part4.R
These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.
Additional information can be found here: TCGA Additional Information.
Download: CIViC
A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main
Download: COSMIC
There are three COSMIC mutation datasets for coding mutations:
- COSMIC Complete Mutation Data (Targeted Screens)
- A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.
- COSMIC Mutation Data (Genome Screens)
- A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).
- COSMIC Mutations Data
- A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.
The COSMIC Mutations Data set was chosen because it combines both the Targeted and Genome Screens.
Downloaded File: COSMIC_SNPs_June_2022.tsv
NOTE Downloading the mutation datasets requires a COSMIC login. With an academic email address, an account can be created for free and the download can be performed.
Fields
The COSMIC dataset contains a large number of fields, many of which were filtered out in order to speed up processing in subsequent steps.
A ‘simplified’ version of the file was used by selecting specific columns from the original downloaded file using the command line tool **awk**
Fields in Simplified Version
Field Name | Example |
---|---|
Accession Number | ENST00000404621.5 |
Sample name | H_LV-3334-1316090 |
Primary site | breast |
Mutation CDS | c.644C>G |
Mutation AA | p.S215* |
Mutation genome position | 12:1244466234-124466234 |
All Fields from COSMIC and Field Descriptions
From 'File Description' drop down menu below 'Cosmic Mutation Data' (on downloads page):
Field Name | Description |
---|---|
Gene name | The gene name for which the data has been curated. |
Accession Number | The transcript identifier of the gene. |
Gene CDS length | Length of the gene (base pair) counts. |
HGNC id | If gene is in HGNC, this id helps linking it to HGNC. |
Sample name | Sample id, Id tumor A assigned. |
Primary Site | The primary tissue/cancer from which the sample originates. |
Site Subtype 1 | Further sub classification (level 1) of the sample’s tissue. |
Download: ICGC
A VCF for release 28 was downloaded from https://dcc.icgc.org/releases/release_28/Summary
Downloaded File: simple_somatic_mutation.aggregated.vcf.gz
Step 2: Convert
In the convert step, all resources are formatted to the Biomuta standard for both data and field structure.
For each resource, a unique script is used to convert from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.
With a common format, all resources can then be combined into a master dataset.
See the individual resource pages for details on the conversion:
Convert: TCGA
Scripts
- process_tcga_download.py
Procedure
Run process_tcga_download.py
Summary
The python script `process_tcga_download.py` will take the output of the TCGA download step and:
- Map the data to:
* uniprot accession * doid parent terms
- Rename fields
- Reformat fields:
* amino acid change and position * chromosome id
- Filter out unnecessary fields
Script Specifications
The script must be called from the command line and takes specific command line arguments.
Input
- -i: A path to the input csv to reformat
- -m: A path to the folder containing all mappings
- -d: A path to the tcga study to doid mapping file
- -e: A path to the ENSP to uniprot mapping file
- -o: A path to the output folder
Output
- A data report comparing new AA sites to old AA sites for Biomuta
Usage
- `python process_tcga_download.py -h`
*Gives a description of the necessary commands*
- `python process_tcga_download.py -i <path/input_file.vcf> -m <path/> -d <doid_mapping.csv> -e <ensp_mapping.csv> -o <path/>`
*Runs the script with the given input TCGA CSV and outputs a formatted CSV*
Additional Notes
All the mapping files are available in the repository folder: `pipeline/convert_step2/mapping`
The mapping files used for converting TCGA are:
DOID:
- `tcga_doid_mapping.csv`
TCGA Projects were mapped to DOID parent terms using the following table (generated from previous Biomuta mapping):
DO_slim_id | DO_slim_name | TCGA_project |
---|---|---|
DOID:5041 | esophageal cancer | TCGA-ESCA |
DOID:2531 | hematologic cancer | TCGA-DLBC |
DOID:9256 | colorectal cancer | TCGA-READ |
DOID:1319 | brain cancer | TCGA-GBM |
DOID:1319 | brain cancer | TCGA-LGG |
DOID:1781 | thyroid cancer | TCGA-THCA |
DOID:11054 | urinary bladder cancer | TCGA-BLCA |
DOID:363 | uterine cancer | TCGA-UCEC |
DOID:169 | neuroendocrine tumor | TCGA-PCPG |
DOID:4362 | cervical cancer | TCGA-CESC |
DOID:363 | uterine cancer | TCGA-UCS |
DOID:3277 | thymus cancer | TCGA-THYM |
DOID:3571 | liver cancer | TCGA-LIHC |
DOID:11934 | head and neck cancer | TCGA-HNSC |
DOID:2174 | ocular cancer | TCGA-UVM |
DOID:4159 | skin cancer | TCGA-SKCM |
DOID:9256 | colorectal cancer | TCGA-COAD |
DOID:3953 | adrenal gland cancer | TCGA-ACC |
DOID:1793 | pancreatic cancer | TCGA-PAAD |
DOID:2994 | germ cell cancer | TCGA-TGCT |
DOID:1324 | lung cancer | TCGA-LUSC |
DOID:1790 | malignant mesothelioma | TCGA-MESO |
DOID:2394 | ovarian cancer | TCGA-OV |
DOID:1115 | sarcoma | TCGA-SARC |
DOID:263 | kidney cancer | TCGA-KIRP |
DOID:263 | kidney cancer | TCGA-KICH |
DOID:10534 | stomach cancer | TCGA-STAD |
DOID:2531 | hematologic cancer | TCGA-LAML |
DOID:10283 | prostate cancer | TCGA-PRAD |
DOID:1324 | lung cancer | TCGA-LUAD |
DOID:1612 | breast cancer | TCGA-BRCA |
DOID:263 | kidney cancer | TCGA-KIRC |
DOID:263 | kidney cancer | TCGA-KICH |
Uniprot Accession:
- `human_protein_transcriptlocus.csv`
Peptide ID (starts with ENSP) was mapped to uniprot isoform accession.
- Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes.*
Convert: CIVIC
Scripts
- genomic liftover > convert_civic_vcf.py > map_civic_csv.py
Procedure
Perform liftover of mutations from GRCh37 to GRCh38
Summary
The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update, we are using the human reference genome GRCh38.
To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates. The CIVIC file is very small in size, so we can use the ENSEMBL online liftover tool: [1](https://useast.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core)
Run the downloaded VCF through the tool with the default parameters (change the file type to VCF).
Redownload the transformed VCF and use that VCF for the next step.
Run convert_civic_vcf.py
Summary
The python script `convert_civic_vcf.py` will convert the VCF formatted file to a CSV file.
With the VCF format, each mutation line in the file can contain multiple annotations and annotation-specific information.
The output CSV format will contain only one annotation per line with associated annotation-specific information.
In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.
Example Line Transformation
Input VCF lines
mutation A info | mutation A annotation 1 info | mutation A annotation 2 info mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info
Output CSV lines
mutation A info,annotation 1 info
mutation A info,annotation 2 info
mutation B info,annotation 1 info
mutation B info,annotation 2 info
mutation B info,annotation 3 info
Script Specifications
The script must be called from the command line and takes specific command line arguments:
Input
- -i: A path to the CIVIC VCF file
- -p: A prefix used for naming the output files
- -o: A path to the output folder, where the mutation data CSV will go
Output
- A CSV file with mutation data
Usage
python convert_civic_vcf.py -h
Gives a description of the necessary commands
python convert_civic_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/> Runs the script with the given input VCF and outputs a CSV file.
Run map_civic_csv.py
==== Summary ==== The python script map_civic_csv.py will take the output of the TCGA download step and:
Map the data to: uniprot accessions doid parent terms Rename fields Reformat fields: amino acid change and position chromosome id genomic location nucleotide change remove indels transform NA values ==== Script Specifications ==== The script must be called from the command line and takes specific command line arguments:
Input
-i: A path to the CIVIC CSV file -m: A path to the folder containing mapping files -d: The name of the doid mapping file -e: The name of the ensp to uniprot accession mapping file -o: A path to the output folder Output
A CSV file with mutation data mapped to doid terms and uniprot accessions
Usage
python map_civic_csv.py -h
Gives a description of the necessary commands
python map_civic_csv.py -i <path/input_file.vcf> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <ensp_mapping_file_name> -o <path/>
Runs the script with the given input CSV and outputs a CSV with mutation mapped to doid terms and uniprot accessions.
Step 3: Combine
In the combined step, all resources are combined into a master dataset.