BioMuta pipeline README: Difference between revisions

From HIVE Lab
Jump to navigation Jump to search
Maria.kim (talk | contribs)
Updated pipeline
 
(4 intermediate revisions by one other user not shown)
Line 1: Line 1:
{{Under construction|custom message = Originally updated by Ned Cauley (August 2022); currently maintained by Maria Kim (September 2024).}}
The BioMuta pipeline has undergone significant changes since 6.0. The old pipeline (version 5.0 and older) is located [https://biomuta.readthedocs.io/en/latest/# here].
 
This page will contain an updated version of this [https://biomuta.readthedocs.io/en/latest/# BioMuta documentation page].


= Description =
= Description =
Line 10: Line 8:
*[https://civicdb.org/welcome Clinical Interpretation of Variants in Cancer (CIVIC)]
*[https://civicdb.org/welcome Clinical Interpretation of Variants in Cancer (CIVIC)]
*[https://cancer.sanger.ac.uk/cosmic Catalogue of Somatic Mutations in Cancer (COSMIC)]
*[https://cancer.sanger.ac.uk/cosmic Catalogue of Somatic Mutations in Cancer (COSMIC)]
*[https://dcc.icgc.org/ International Cancer Genome Consortium (ICGC)] (retired)


BioMuta gathers mutation data for the following cancers:
BioMuta gathers mutation data for the following cancers:
Line 34: Line 31:


= Running the Pipeline =
= Running the Pipeline =
To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: [https://github.com/GW-HIVE/biomuta GW HIVE BioMuta Repository].
To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: [https://github.com/GW-HIVE/biomuta-old GW HIVE BioMuta Repository].


= Pipeline Overview =
= Pipeline Overview =
== Step 1: Download ==
== Step 1: Download ==
In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions.
In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions. Downloader scripts are located at <code>pipeline/download_step1/$RESOURCE</code>.


=== Download: TCGA ===
=== Download: cBioPortal ===
Annotated variant files are downloaded from the ISB-CGC Big Query repository.
1. Download mutation data: <code>fetch_mutations.sh</code>


Field descriptions for Big Query output available in [https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/field_names_descriptions.csv field_names_descriptions.csv]. Additional field descriptions available on [https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/ GDC docs].
This script uses [https://www.cbioportal.org/api/swagger-ui/index.html cBioPortal API] Studies endpoint to fetch the complete list of available study IDs, and fetches mutation data in the JSON format for every possible Molecular Profile and Sample List for each study.


The list of studies used in TCGA can be found here: [[List of TCGA studies]].
2. Download cancer types: <code>cancer_types.sh</code>


==== Gain access to data ====
This script fetches cancer types associated with each study ID in order to map study IDs to Disease Ontology IDs (DOID) in the Step 2: Convert.
There are two parts to obtaining data from TCGA:
 
1. Primary TCGA data
* Available at [https://gdc.cancer.gov/ NCI Genomic Data Commons].
* Accessible through the ISB-CGC Big Query repository (more info TBA). For complete documentation, see the [https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/ ISB-CGC Read the Docs pages].
2. TCGA controlled-access data
* Hosted at dbGaP. For information on how to get access, see [https://gwu0.sharepoint.com/sites/MazumderLab/SitePages/How-to-Access-dbGaP.aspx Sharepoint].
 
==== Run downloader R script using R Studio ====
Required script: [https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/TCGA_mutation_download.R TCGA_mutation_download.R]
 
Run each line one after the other, instead of the whole script at once.
 
Running {{Inline-code|library(bigrquery)}} and calling this library with {{Inline-code|bq_project_query()}} (later in the script) will open a browser to login with Google credentials.
*Use the Google account registered for a ISB-CGC project and with dbGaP authorization.
*After logging in, a token will be saved so that you can login through R Studio instead.
 
This script will download all mutation data for TCGA.
 
Since the downloaded file is very large, there might be issues running this script. If this is the case, run the following scripts in the [https://github.com/GW-HIVE/biomuta-old/tree/master/pipeline/download_step1/tcga tcga] folder:
*[https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/TCGA_mutation_download_part1.R TCGA_mutation_download_part1.R]
*[https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/TCGA_mutation_download_part2.R TCGA_mutation_download_part2.R]
*[https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/TCGA_mutation_download_part2.R TCGA_mutation_download_part3.R]
*[https://github.com/GW-HIVE/biomuta-old/blob/master/pipeline/download_step1/tcga/TCGA_mutation_download_part2.R TCGA_mutation_download_part4.R]
 
These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.
 
Additional information can be found here: [[TCGA Additional Information]].


=== Download: CIViC ===
=== Download: CIViC ===
A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main
A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main
=== Download: COSMIC ===
There are three COSMIC mutation datasets for coding mutations:
*COSMIC Complete Mutation Data (Targeted Screens)
:A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.
*COSMIC Mutation Data (Genome Screens)
:A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).
*COSMIC Mutations Data
:A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.
The COSMIC Mutations Data set was chosen because it combines both the Targeted and Genome Screens.
Downloaded File: COSMIC_SNPs_June_2022.tsv
'''NOTE''' Downloading the mutation datasets requires a COSMIC login. With an academic email address, an account can be created for free and the download can be performed.
== Fields ==
The COSMIC dataset contains a large number of fields, many of which were filtered out in order to speed up processing in subsequent steps.
A ‘simplified’ version of the file was used by selecting specific columns from the original downloaded file using the command line tool **awk**
=== Fields in Simplified Version ===
{| class="wikitable"
! Field Name
! Example
|-
| Accession Number
| ENST00000404621.5
|-
| Sample name
| H_LV-3334-1316090
|-
| Primary site
| breast
|-
| Mutation CDS
| c.644C>G
|-
| Mutation AA
| p.S215*
|-
| Mutation genome position
| 12:1244466234-124466234
|}
=== All Fields from COSMIC and Field Descriptions ===
From 'File Description' drop down menu below 'Cosmic Mutation Data' (on downloads page):
{| class="wikitable"
! Field Name
! Description
|-
| Gene name
| The gene name for which the data has been curated.
|-
| Accession Number
| The transcript identifier of the gene.
|-
| Gene CDS length
| Length of the gene (base pair) counts.
|-
| HGNC id
| If gene is in HGNC, this id helps linking it to HGNC.
|-
| Sample name
| Sample id, Id tumor A assigned.
|-
| Primary Site
| The primary tissue/cancer from which the sample originates.
|-
| Site Subtype 1
| Further sub classification (level 1) of the sample’s tissue.
|}
=== Download: ICGC ===
A VCF for release 28 was downloaded from https://dcc.icgc.org/releases/release_28/Summary
Downloaded File: simple_somatic_mutation.aggregated.vcf.gz


== Step 2: Convert ==
== Step 2: Convert ==
In the convert step, all resources are formatted to the Biomuta standard for both data and field structure.
In the convert step, all resources are formatted to the BioMuta standard for both data and field structure. Conversion scripts are located at <code>pipeline/convert_step2/$RESOURCE</code>.


For each resource, a unique script is used to convert from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.
For each resource, a conversion is done from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.


With a common format, all resources can then be combined into a master dataset.
With a common format, all resources can then be combined into a master dataset.
Line 169: Line 58:
See the individual resource pages for details on the conversion:
See the individual resource pages for details on the conversion:


== Convert: TCGA ==
== Convert: cBioPortal ==


=== Scripts ===
=== Scripts ===
* process_tcga_download.py
==== Liftover ====
Located at <code>pipeline/convert_step2/liftover</code>.
 
In this step GRCh37 genomic positions in cBioPortal raw JSON files are converted to GRCh38 using the command-line [https://genome.ucsc.edu/cgi-bin/hgLiftOver LiftOver tool] by UCSC.
 
* <code>1_chr_pos_to_bed.py</code> outputs GRCh37 genomic positions in the BED format
* <code>2_liftover.sh</code> takes the BED file and performs the conversion to GRCh38
 
==== Conversion ====
* <code>1_generate_cancer_do_json.py</code>
* <code>2_parse_gff.py</code>
* <code>3_1_ensp_to_uniprot_from_glygen.py</code>
* <code>3_2_ensp_to_uniprot_api.sh</code>
* <code>4_1_canonical_yes_no.py</code>
* <code>4_2_merge_canonical.py</code>
* <code>5_compare_fasta.py</code>
* <code>6_1_create_dict_faster.py</code>


=== Procedure ===
=== Procedure ===
'''Run process_tcga_download.py'''
<code>1_generate_cancer_do_json.py</code> is a standalone script. The rest of the scripts must be run sequentially.


==== Summary ====
==== Summary ====
The python script `process_tcga_download.py` will take the output of the TCGA download step and:
The python script <code>1_generate_cancer_do_json.py</code> will take the JSON dictionary downloaded in step 1 mapping study IDs to cancer types and use it as an intermediary to map study IDs to DO cancer slim terms listed in the description at the beginning of this page.
* Map the data to:
  * uniprot accession
  * doid parent terms
* Rename fields
* Reformat fields:
  * amino acid change and position
  * chromosome id
* Filter out unnecessary fields


==== Script Specifications ====
The rest of the conversion goes in this order:
The script must be called from the command line and takes specific command line arguments.
* <code>2_parse_gff.py</code>


'''Input'''
Parses the annotations file located at <code>downloads/ensembl/Homo_sapiens.GRCh38.113.db</code> using the <code>gffutils</code> Python package (see this [https://github.com/daler/gffutils GitHub page]). Each genomic position in the raw JSON files is assigned its corresponding ENSEMBL protein ID (prefix <code>ENSP</code>). Outputs <code>chr_pos_to_ensp.tsv</code>.
* -i: A path to the input csv to reformat
* -m: A path to the folder containing all mappings
* -d: A path to the tcga study to doid mapping file
* -e: A path to the ENSP to uniprot mapping file
* -o: A path to the output folder


'''Output'''
* <code>3_1_ensp_to_uniprot_from_glygen.py</code>
* A data report comparing new AA sites to old AA sites for Biomuta


=== Usage ===
Takes <code>chr_pos_to_ensp.tsv</code> and maps each ENSP ID to its corresponding UniProt accession number using <code>downloads/glygen/human_protein_transcriptlocus.csv</code> from [https://data.glygen.org/GLY_000135 GlyGen].
* `python process_tcga_download.py -h`
  *Gives a description of the necessary commands*


* `python process_tcga_download.py -i <path/input_file.vcf> -m <path/> -d <doid_mapping.csv> -e <ensp_mapping.csv> -o <path/>`
* <code>3_2_ensp_to_uniprot_api.sh</code>
  *Runs the script with the given input TCGA CSV and outputs a formatted CSV*


=== Additional Notes ===
Maps all remaining ENSP IDs to UniProt accession numbers not found in <code>human_protein_transcriptlocus.csv</code> using UniProt API.
All the mapping files are available in the repository folder:
`pipeline/convert_step2/mapping`


The mapping files used for converting TCGA are:
* <code>4_1_canonical_yes_no.py</code>


'''DOID:'''
Filters out non-canonical UniProt accession numbers.
* `tcga_doid_mapping.csv`


TCGA Projects were mapped to DOID parent terms using the following table (generated from previous Biomuta mapping):
* <code>4_2_merge_canonical.py</code>


Merges the filtered outputs of <code>3_1_ensp_to_uniprot_from_glygen.py</code> and <code>3_2_ensp_to_uniprot_api.sh</code>.


{| class="wikitable"
* <code>5_compare_fasta.py</code>
! DO_slim_id
! DO_slim_name
! TCGA_project
|-
| DOID:5041
| esophageal cancer
| TCGA-ESCA
|-
| DOID:2531
| hematologic cancer
| TCGA-DLBC
|-
| DOID:9256
| colorectal cancer
| TCGA-READ
|-
| DOID:1319
| brain cancer
| TCGA-GBM
|-
| DOID:1319
| brain cancer
| TCGA-LGG
|-
| DOID:1781
| thyroid cancer
| TCGA-THCA
|-
| DOID:11054
| urinary bladder cancer
| TCGA-BLCA
|-
| DOID:363
| uterine cancer
| TCGA-UCEC
|-
| DOID:169
| neuroendocrine tumor
| TCGA-PCPG
|-
| DOID:4362
| cervical cancer
| TCGA-CESC
|-
| DOID:363
| uterine cancer
| TCGA-UCS
|-
| DOID:3277
| thymus cancer
| TCGA-THYM
|-
| DOID:3571
| liver cancer
| TCGA-LIHC
|-
| DOID:11934
| head and neck cancer
| TCGA-HNSC
|-
| DOID:2174
| ocular cancer
| TCGA-UVM
|-
| DOID:4159
| skin cancer
| TCGA-SKCM
|-
| DOID:9256
| colorectal cancer
| TCGA-COAD
|-
| DOID:3953
| adrenal gland cancer
| TCGA-ACC
|-
| DOID:1793
| pancreatic cancer
| TCGA-PAAD
|-
| DOID:2994
| germ cell cancer
| TCGA-TGCT
|-
| DOID:1324
| lung cancer
| TCGA-LUSC
|-
| DOID:1790
| malignant mesothelioma
| TCGA-MESO
|-
| DOID:2394
| ovarian cancer
| TCGA-OV
|-
| DOID:1115
| sarcoma
| TCGA-SARC
|-
| DOID:263
| kidney cancer
| TCGA-KIRP
|-
| DOID:263
| kidney cancer
| TCGA-KICH
|-
| DOID:10534
| stomach cancer
| TCGA-STAD
|-
| DOID:2531
| hematologic cancer
| TCGA-LAML
|-
| DOID:10283
| prostate cancer
| TCGA-PRAD
|-
| DOID:1324
| lung cancer
| TCGA-LUAD
|-
| DOID:1612
| breast cancer
| TCGA-BRCA
|-
| DOID:263
| kidney cancer
| TCGA-KIRC
|-
| DOID:263
| kidney cancer
| TCGA-KICH
|}


'''Uniprot Accession:'''
Compares canonical FASTA sequences to their ENSEMBL couterparts.
* `human_protein_transcriptlocus.csv`


Peptide ID (starts with ENSP) was mapped to uniprot isoform accession.
* <code>6_1_create_dict_faster.py</code>


*Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes.*
Creates the dictionary used in the combining step.


== Convert: CIVIC ==
== Convert: CIVIC ==
Line 434: Line 183:
=== Run map_civic_csv.py ===
=== Run map_civic_csv.py ===


==== Summary ==== The python script map_civic_csv.py will take the output of the TCGA download step and:
==== Summary ====  
 
The python script map_civic_csv.py will take the output of the TCGA download step and:


Map the data to:
Map the data to:
Line 447: Line 198:
remove indels
remove indels
transform NA values
transform NA values
==== Script Specifications ==== The script must be called from the command line and takes specific command line arguments:
 
==== Script Specifications ====  
 
The script must be called from the command line and takes specific command line arguments:


'''Input'''
'''Input'''
Line 469: Line 223:
Runs the script with the given input CSV and outputs a CSV with mutation mapped to doid terms and uniprot accessions.
Runs the script with the given input CSV and outputs a CSV with mutation mapped to doid terms and uniprot accessions.


[[Additional notes]]
[[Additional notes- CIVIC]]


== Convert: COSMIC ==
== Convert: COSMIC ==
Line 516: Line 270:


[[Additional Notes for COSMIC]]
[[Additional Notes for COSMIC]]
== Convert: ICGC ==
=== Scripts ===
* genomic liftover (mapvcf_copySA.py) > convert_icgc_vcf.py > map_icgc.py
=== Procedure ===
'''Perform liftover of mutations from GRCh37 to GRCh38 (mapvcf_copySA)'''
==== Summary ====
The most recent data release for ICGC is aligned to the GRCH37 human reference genome. For this update, we are using the human reference genome GRCh38.
To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates.
Seun performed the liftover and provided the notes listed below.
==== Genomic Liftover Notes ====
'''VCF'''
A VCF (Variant Call Format) file is a text file used to store gene sequence variations. The files often start with lines of metadata, then headers relating to the variants described. Because the standard for formatting and relaying genomic data is always evolving, there are numerous versions and references for VCF files and the dependencies they use.
==== Fields ====
[[Common fields for VCF files ]]
==== Converting with CrossMap ====
CrossMap is a program that can convert genome coordinates between different assemblies, such as hg18 (GRCh36) to hg19 (GRCh37). It is made in Python and offered as a webtool, by Ensembl in limited capacity or as a local script for full functionality. This gives extra customizability and the option to convert files over 50 mb, it is necessary to run a local edition of CrossMap.
Crossmap Documentation: [http://crossmap.sourceforge.net/](http://crossmap.sourceforge.net/)
==== Requirements ====
* Python2 or Python3 installed on a Linux server
* Chain file - describes a pairwise alignment between two reference assemblies
* They can be found through UCSC, Ensembl, and other sources
* Compressed files are allowed
* `hg19ToHg38.over.chain` was best tested
* Target, input file - file to be converted in format compatible with CrossMap
* CrossMap supports vcf, bam/cram/sam, maf, and other formats; compressed files are allowed
* Referencefile - fasta format of the wanted genome assembly
==== Other files used ====
* `mapvcf` is the script from the package that does the conversion, attached is the version I used. I believe commenting out lines 100:109 is what allowed it to work.
* `hg19ToHg38` is the chain file that I used
* This is the command I used to get the assembly file from UCSC: `wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz`
* This is the command I used to unzip the assembly file: `gzip -dk hg38.fa.gz`
* This is the exact command I ran to create the file: `python3 ./local/bin/CrossMap.py vcf /mnt/d/hg19ToHg38.over.chain.gz /mnt/d/icgc_missense_mutations.vcf /mnt/d/icgc_missense_mutations_38_hg7.vcf`
* Of note, there are numerous other assembly and chain files. I tried 3 or 4 of each and the ones linked here were the best. I determined best by both what the script relays and how big the final vcf file were.
==== Output ====
Two output files were generated from the liftover and stored on the OncoMX-tst server at `/software/pipeline/integrator/downloads/biomuta/v-5.0/icgc/ - icgc_missense_mutations_38.vcf`
* All mutations with converted coordinates
* `icgc_missense_mutations_38_fail.vcf` - Mutations whose coordinates could not be converted
Only the mutations whose coordinates were successfully converted were carried forward in the pipeline.
'''Run convert_icgc_vcf.py'''
==== Summary ====
The python script `convert_icgc_vcf.py` will convert the VCF formatted mutation file to a CSV file.
With the VCF format, each mutation line in the file can contain multiple annotations and annotation-specific information.
The output CSV format will contain only one annotation per line with associated annotation-specific information.
In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.
==== Example Line Transformation ====
'''Input VCF lines'''
mutation A info | mutation A annotation 1 info | mutation A annotation 2 info
mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info
'''Output CSV lines'''
mutation A info,annotation 1 info
mutation A info,annotation 2 info
mutation B info,annotation 1 info
mutation B info,annotation 2 info
mutation B info,annotation 3 info
==== Script Specifications ====
The script must be called from the command line and takes specific command line arguments:
'''Input'''
* -i: A path to the ICGC VCF file
* -s : A schema file containing the field names in the annotations and to use for the output file
* -o: A path to the output folder, where the mutation data CSV will go
'''Output'''
* A .csv file with mutation data where each row contains one mutation and one unique annotation
==== Usage ====
python convert_icgc_vcf.py -h
Gives a description of the necessary commands
python convert_icgc_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/>
Runs the script with the given input VCF and outputs a CSV file.
=== Run map_icgc.py ===
==== Summary ====
The python script `map_icgc.py` will take the output of the VCF convertor script and:
* Map the data to:
  * uniprot accessions
  * doid parent terms
* Rename fields
* Reformat fields:
  * amino acid change and position
  * chromosome id
  * genomic location
  * nucleotide change
==== Script Specifications ====
The script must be called from the command line and takes specific command line arguments.
'''Input'''
* -i: A path to the ICGC .csv file
* -m: A path to the folder containing mapping files
* -d: The name of the doid mapping file
* -e: The name of the ensp to uniprot accession mapping file
* -o: A path to the output folder
'''Output'''
* A .csv file with mutation data formatted to the biomuta field structure
==== Usage ====
python map_icgc.py -h
Gives a description of the necessary commands
python map_icgc.py -i <path/input_file.vcf> -m <path/> -d doid_mapping_file.csv -e enst_mapping_file.csv -o <path/>
Runs the script with the given CSV file and outputs a CSV file formatted for the final biomuta master file.
[[Additional Notes]]


== Step 3: Combine ==
== Step 3: Combine ==
In the combined step, all resources are combined into a master dataset.
In the combined step, all resources are combined into a master dataset. The scripts are located at <code>pipeline/combine_step3</code>.


=== Scripts ===
=== Scripts ===
* combine_csv.py
* <code>combine_cbio.py</code>
* <code>combine_csv.py</code>


=== Procedure ===
=== Procedure ===


'''Run combine_csv.py'''
'''Run <code>combine_cbio.py</code>, then <code>combine_csv.py</code>'''


==== Summary ====
==== Summary ====

Latest revision as of 20:51, 10 March 2025

The BioMuta pipeline has undergone significant changes since 6.0. The old pipeline (version 5.0 and older) is located here.

Description

The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.

The sources included in BioMuta are:

BioMuta gathers mutation data for the following cancers:

  • Urinary Bladder Cancer (DOID:11054)
  • Breast Cancer (DOID:1612)
  • Colorectal (DOID:9256)
  • Esophageal Cancer (DOID:5041)
  • Head and Neck Cancer (DOID:11934)
  • Kidney Cancer (DOID:263)
  • Liver Cancer (DOID:3571)
  • Lung Cancer (DOID:1324)
  • Prostate Cancer (DOID:10283)
  • Stomach Cancer (DOID:10534)
  • Thyroid Gland Cancer (DOID:1781)
  • Uterine Cancer (DOID:363)
  • Cervical Cancer (DOID:4362)
  • Brain Cancer (DOID:1319)
  • Hematologic Cancer (DOID:2531)
  • Adrenal Gland Cancer (DOID:3953)
  • Pancreatic Cancer (DOID:1793)
  • Ovarian Cancer (DOID:2394)
  • Skin Cancer (DOID:4159)

Running the Pipeline

To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.

Pipeline Overview

Step 1: Download

In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions. Downloader scripts are located at pipeline/download_step1/$RESOURCE.

Download: cBioPortal

1. Download mutation data: fetch_mutations.sh

This script uses cBioPortal API Studies endpoint to fetch the complete list of available study IDs, and fetches mutation data in the JSON format for every possible Molecular Profile and Sample List for each study.

2. Download cancer types: cancer_types.sh

This script fetches cancer types associated with each study ID in order to map study IDs to Disease Ontology IDs (DOID) in the Step 2: Convert.

Download: CIViC

A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main

Step 2: Convert

In the convert step, all resources are formatted to the BioMuta standard for both data and field structure. Conversion scripts are located at pipeline/convert_step2/$RESOURCE.

For each resource, a conversion is done from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.

With a common format, all resources can then be combined into a master dataset.

See the individual resource pages for details on the conversion:

Convert: cBioPortal

Scripts

Liftover

Located at pipeline/convert_step2/liftover.

In this step GRCh37 genomic positions in cBioPortal raw JSON files are converted to GRCh38 using the command-line LiftOver tool by UCSC.

  • 1_chr_pos_to_bed.py outputs GRCh37 genomic positions in the BED format
  • 2_liftover.sh takes the BED file and performs the conversion to GRCh38

Conversion

  • 1_generate_cancer_do_json.py
  • 2_parse_gff.py
  • 3_1_ensp_to_uniprot_from_glygen.py
  • 3_2_ensp_to_uniprot_api.sh
  • 4_1_canonical_yes_no.py
  • 4_2_merge_canonical.py
  • 5_compare_fasta.py
  • 6_1_create_dict_faster.py

Procedure

1_generate_cancer_do_json.py is a standalone script. The rest of the scripts must be run sequentially.

Summary

The python script 1_generate_cancer_do_json.py will take the JSON dictionary downloaded in step 1 mapping study IDs to cancer types and use it as an intermediary to map study IDs to DO cancer slim terms listed in the description at the beginning of this page.

The rest of the conversion goes in this order:

  • 2_parse_gff.py

Parses the annotations file located at downloads/ensembl/Homo_sapiens.GRCh38.113.db using the gffutils Python package (see this GitHub page). Each genomic position in the raw JSON files is assigned its corresponding ENSEMBL protein ID (prefix ENSP). Outputs chr_pos_to_ensp.tsv.

  • 3_1_ensp_to_uniprot_from_glygen.py

Takes chr_pos_to_ensp.tsv and maps each ENSP ID to its corresponding UniProt accession number using downloads/glygen/human_protein_transcriptlocus.csv from GlyGen.

  • 3_2_ensp_to_uniprot_api.sh

Maps all remaining ENSP IDs to UniProt accession numbers not found in human_protein_transcriptlocus.csv using UniProt API.

  • 4_1_canonical_yes_no.py

Filters out non-canonical UniProt accession numbers.

  • 4_2_merge_canonical.py

Merges the filtered outputs of 3_1_ensp_to_uniprot_from_glygen.py and 3_2_ensp_to_uniprot_api.sh.

  • 5_compare_fasta.py

Compares canonical FASTA sequences to their ENSEMBL couterparts.

  • 6_1_create_dict_faster.py

Creates the dictionary used in the combining step.

Convert: CIVIC

Scripts

  • genomic liftover > convert_civic_vcf.py > map_civic_csv.py

Procedure

Perform liftover of mutations from GRCh37 to GRCh38

Summary

The most recent data release for CIVIC is aligned to the GRCH37 human reference genome. For this update, we are using the human reference genome GRCh38.

To convert coordinates between the two reference genomes, we use a ‘liftover’ tool to remap the genomic coordinates. The CIVIC file is very small in size, so we can use the ENSEMBL online liftover tool: [1](https://useast.ensembl.org/Homo_sapiens/Tools/AssemblyConverter?db=core)

Run the downloaded VCF through the tool with the default parameters (change the file type to VCF).

Redownload the transformed VCF and use that VCF for the next step.

Run convert_civic_vcf.py

Summary

The python script `convert_civic_vcf.py` will convert the VCF formatted file to a CSV file.

With the VCF format, each mutation line in the file can contain multiple annotations and annotation-specific information.

The output CSV format will contain only one annotation per line with associated annotation-specific information.

In order to know how the information for the mutation and annotation fields are structured, a schema describing the fields is provided to the script.

Example Line Transformation

Input VCF lines

mutation A info | mutation A annotation 1 info | mutation A annotation 2 info

mutation B info | mutation B annotation 1 info | mutation B annotation 2 info | mutation B annotation 3 info

Output CSV lines

mutation A info,annotation 1 info

mutation A info,annotation 2 info

mutation B info,annotation 1 info

mutation B info,annotation 2 info

mutation B info,annotation 3 info

Script Specifications

The script must be called from the command line and takes specific command line arguments:

Input

  • -i: A path to the CIVIC VCF file
  • -p: A prefix used for naming the output files
  • -o: A path to the output folder, where the mutation data CSV will go

Output

  • A CSV file with mutation data

Usage

python convert_civic_vcf.py -h

Gives a description of the necessary commands

python convert_civic_vcf.py -i <path/input_file.vcf> -s <path/schema.json> -o <path/> Runs the script with the given input VCF and outputs a CSV file.

Run map_civic_csv.py

Summary

The python script map_civic_csv.py will take the output of the TCGA download step and:

Map the data to: uniprot accessions doid parent terms Rename fields Reformat fields: amino acid change and position chromosome id genomic location nucleotide change remove indels transform NA values

Script Specifications

The script must be called from the command line and takes specific command line arguments:

Input

-i: A path to the CIVIC CSV file -m: A path to the folder containing mapping files -d: The name of the doid mapping file -e: The name of the ensp to uniprot accession mapping file -o: A path to the output folder Output

A CSV file with mutation data mapped to doid terms and uniprot accessions

Usage

python map_civic_csv.py -h

Gives a description of the necessary commands

python map_civic_csv.py -i <path/input_file.vcf> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <ensp_mapping_file_name> -o <path/>

Runs the script with the given input CSV and outputs a CSV with mutation mapped to doid terms and uniprot accessions.

Additional notes- CIVIC

Convert: COSMIC

Scripts

  • map_cosmic_tsv.py

Procedure

Run map_cosmic_tsv.py

Summary

The python script `map_cosmic_tsv.py` will take the output of the TCGA download step and:

  • Map the data to:
 * uniprot accessions
 * doid parent terms
  • Rename fields
  • Reformat fields:
 * amino acid change and position
 * chromosome id
 * genomic location
 * nucleotide change

Script Specifications

The script must be called from the command line and takes specific command line arguments.

Input

  • -i : A path to the cosmic tsv mutation file
  • -m : A path to the folder containing mapping files
  • -d : The name of the doid to cosmic cancer type mapping file
  • -e : The name of the enst to uniprot accession mapping file
  • -o : A path to the folder to export the final mapped mutations

Output

  • A mutation file with COSMIC mutations mapped to doid terms and uniprot accessions

Usage

map_cosmic_tsv -h

Gives a description of the necessary commands

python map_cosmic_tsv.py -i <path/cosmic_file_name.tsv> -m <path/mapping_folder> -d <doid_mapping_file_name> -e <enst_mapping_file_name> -o <path/output_folder>

Runs the script with the given input file and exports the mapped mutation file.

Additional Notes for COSMIC

Step 3: Combine

In the combined step, all resources are combined into a master dataset. The scripts are located at pipeline/combine_step3.

Scripts

  • combine_cbio.py
  • combine_csv.py

Procedure

Run combine_cbio.py, then combine_csv.py

Summary

All of the mutation data for each source was converted to a standardized data structure in the convert step.

Now, all of these separate csv files (one for each source) will be combined into a master CSV file.

All CSV files to be combined should be in a folder together with no additional CSV files.

Script Specifications

The script must be called from the command line and takes specific command line arguments.

Input

  • -i : The folder containing CSV mutation files to combine
  • -o : The folder to output the combined mutation file

Output

  • A CSV file combining all CSV files in a given folder

Usage

python combine_csv.py -h

Gives a description of the necessary commands

python combine_csv.py -i <path/> -o <path/>

Runs the script with the given folder and combines all CSV files in that folder

Final Fields