BioMuta pipeline README
The BioMuta pipeline has undergone significant changes since 6.0. The old pipeline (version 5.0 and older) is located here.
Description
The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.
The sources included in BioMuta are:
- The Cancer Genome Atlas (TCGA)
- Clinical Interpretation of Variants in Cancer (CIVIC)
- Catalogue of Somatic Mutations in Cancer (COSMIC)
BioMuta gathers mutation data for the following cancers:
- Urinary Bladder Cancer (DOID:11054)
- Breast Cancer (DOID:1612)
- Colorectal (DOID:9256)
- Esophageal Cancer (DOID:5041)
- Head and Neck Cancer (DOID:11934)
- Kidney Cancer (DOID:263)
- Liver Cancer (DOID:3571)
- Lung Cancer (DOID:1324)
- Prostate Cancer (DOID:10283)
- Stomach Cancer (DOID:10534)
- Thyroid Gland Cancer (DOID:1781)
- Uterine Cancer (DOID:363)
- Cervical Cancer (DOID:4362)
- Brain Cancer (DOID:1319)
- Hematologic Cancer (DOID:2531)
- Adrenal Gland Cancer (DOID:3953)
- Pancreatic Cancer (DOID:1793)
- Ovarian Cancer (DOID:2394)
- Skin Cancer (DOID:4159)
Running the Pipeline
To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.
Pipeline Overview
Step 1: Download
In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions. Downloader scripts are located at pipeline/download_step1/$RESOURCE
.
Download: cBioPortal
1. Download mutation data: fetch_mutations.sh
This script uses cBioPortal API Studies endpoint to fetch the complete list of available study IDs, and fetches mutation data in the JSON format for every possible Molecular Profile and Sample List for each study.
2. Download cancer types: cancer_types.sh
This script fetches cancer types associated with each study ID in order to map study IDs to Disease Ontology IDs (DOID) in the Step 2: Convert.
Download: CIViC
A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main
Step 2: Convert
In the convert step, all resources are formatted to the BioMuta standard for both data and field structure. Conversion scripts are located at pipeline/convert_step2/$RESOURCE
.
For each resource, a conversion is done from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.
With a common format, all resources can then be combined into a master dataset.
See the individual resource pages for details on the conversion:
Convert: cBioPortal
Scripts
Liftover
Located at pipeline/convert_step2/liftover
.
In this step GRCh37 genomic positions in cBioPortal raw JSON files are converted to GRCh38 using the command-line LiftOver tool by UCSC.
1_chr_pos_to_bed.py
outputs GRCh37 genomic positions in the BED format2_liftover.sh
takes the BED file and performs the conversion to GRCh38
Conversion
1_generate_cancer_do_json.py
2_parse_gff.py
3_1_ensp_to_uniprot_from_glygen.py
3_2_ensp_to_uniprot_api.sh
4_1_canonical_yes_no.py
4_2_merge_canonical.py
5_compare_fasta.py
6_1_create_dict_faster.py
Procedure
1_generate_cancer_do_json.py
is a standalone script. The rest of the scripts must be run sequentially.
Summary
The python script 1_generate_cancer_do_json.py
will take the JSON dictionary downloaded in step 1 mapping study IDs to cancer types and use it as an intermediary to map study IDs to DO cancer slim terms listed in the description at the beginning of this page.
The rest of the conversion goes in this order:
2_parse_gff.py
Parses the annotations file located at downloads/ensembl/Homo_sapiens.GRCh38.113.db
using the gffutils
Python package (see this GitHub page). Each genomic position in the raw JSON files is assigned its corresponding ENSEMBL protein ID (prefix ENSP
). Outputs chr_pos_to_ensp.tsv
.
3_1_ensp_to_uniprot_from_glygen.py
Takes chr_pos_to_ensp.tsv
and maps each ENSP ID to its corresponding UniProt accession number using downloads/glygen/human_protein_transcriptlocus.csv
from GlyGen.
3_2_ensp_to_uniprot_api.sh
Maps all remaining ENSP IDs to UniProt accession numbers not found in human_protein_transcriptlocus.csv
using UniProt API.
4_1_canonical_yes_no.py
Filters out non-canonical UniProt accession numbers.
4_2_merge_canonical.py
Merges the filtered outputs of 3_1_ensp_to_uniprot_from_glygen.py
and 3_2_ensp_to_uniprot_api.sh
.
5_compare_fasta.py
Compares canonical FASTA sequences to their ENSEMBL couterparts.
6_1_create_dict_faster.py
Creates the dictionary used in the combining step.
Convert: CIVIC
Scripts
- genomic liftover > convert_civic_vcf.py > map_civic_csv.py
Procedure
Perform liftover of mutations from GRCh37 to GRCh38
Summary
The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update, we are using the human reference genome GRCh38.
To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates. The CIVIC file is very small in size, so we can use the ENSEMBL online liftover tool: [1](https://useast.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core)
Run the downloaded VCF through the tool with the default parameters (change the file type to VCF).
Redownload the transformed VCF and use that VCF for the next step.
Run convert_civic_vcf.py
Summary
The python script `convert_civic_vcf.py` will convert the VCF formatted file to a CSV file.
With the VCF format, each mutation line in the file can contain multiple annotations and annotation-specific information.
The output CSV format will contain only one annotation per line with associated annotation-specific information.
In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.
Example Line Transformation
Input VCF lines
mutation A info | mutation A annotation 1 info | mutation A annotation 2 info
mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info
Output CSV lines
mutation A info,annotation 1 info
mutation A info,annotation 2 info
mutation B info,annotation 1 info
mutation B info,annotation 2 info
mutation B info,annotation 3 info
Script Specifications
The script must be called from the command line and takes specific command line arguments:
Input
- -i: A path to the CIVIC VCF file
- -p: A prefix used for naming the output files
- -o: A path to the output folder, where the mutation data CSV will go
Output
- A CSV file with mutation data
Usage
python convert_civic_vcf.py -h
Gives a description of the necessary commands
python convert_civic_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/> Runs the script with the given input VCF and outputs a CSV file.
Run map_civic_csv.py
Summary
The python script map_civic_csv.py will take the output of the TCGA download step and:
Map the data to: uniprot accessions doid parent terms Rename fields Reformat fields: amino acid change and position chromosome id genomic location nucleotide change remove indels transform NA values
Script Specifications
The script must be called from the command line and takes specific command line arguments:
Input
-i: A path to the CIVIC CSV file -m: A path to the folder containing mapping files -d: The name of the doid mapping file -e: The name of the ensp to uniprot accession mapping file -o: A path to the output folder Output
A CSV file with mutation data mapped to doid terms and uniprot accessions
Usage
python map_civic_csv.py -h
Gives a description of the necessary commands
python map_civic_csv.py -i <path/input_file.vcf> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <ensp_mapping_file_name> -o <path/>
Runs the script with the given input CSV and outputs a CSV with mutation mapped to doid terms and uniprot accessions.
Convert: COSMIC
Scripts
- map_cosmic_tsv.py
Procedure
Run map_cosmic_tsv.py
Summary
The python script `map_cosmic_tsv.py` will take the output of the TCGA download step and:
- Map the data to:
* uniprot accessions * doid parent terms
- Rename fields
- Reformat fields:
* amino acid change and position * chromosome id * genomic location * nucleotide change
Script Specifications
The script must be called from the command line and takes specific command line arguments.
Input
- -i : A path to the cosmic tsv mutation file
- -m : A path to the folder containing mapping files
- -d : The name of the doid to cosmic cancer type mapping file
- -e : The name of the enst to uniprot accession mapping file
- -o : A path to the folder to export the final mapped mutations
Output
- A mutation file with COSMIC mutations mapped to doid terms and uniprot accessions
Usage
map_cosmic_tsv -h
Gives a description of the necessary commands
python map_cosmic_tsv.py -i <path/cosmic_file_name.tsv> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <enst_mapping_file_name> -o <path/output_folder>
Runs the script with the given input file and exports the mapped mutation file.
Step 3: Combine
In the combined step, all resources are combined into a master dataset. The scripts are located at pipeline/combine_step3
.
Scripts
combine_cbio.py
combine_csv.py
Procedure
Run combine_cbio.py
, then combine_csv.py
Summary
All of the mutation data for each source was converted to a standardized data structure in the convert step.
Now, all of these separate csv files (one for each source) will be combined into a master CSV file.
All CSV files to be combined should be in a folder together with no additional CSV files.
Script Specifications
The script must be called from the command line and takes specific command line arguments.
Input
- -i : The folder containing CSV mutation files to combine
- -o : The folder to output the combined mutation file
Output
- A CSV file combining all CSV files in a given folder
Usage
python combine_csv.py -h
Gives a description of the necessary commands
python combine_csv.py -i <path/> -o <path/>
Runs the script with the given folder and combines all CSV files in that folder