Under construction

This article is still under construction and should not be nominated for deletion.

Originally updated by Ned Cauley (August 2022); currently maintained by Maria Kim (September 2024).

This page will contain an updated version of this BioMuta documentation page.

Description

The Biomuta pipeline gathers mutation data from various sources and combines them into a single dataset under common field structure.

The sources included in BioMuta are:

BioMuta gathers mutation data for the following cancers:

Urinary Bladder Cancer (DOID:11054)
Breast Cancer (DOID:1612)
Colorectal (DOID:9256)
Esophageal Cancer (DOID:5041)
Head and Neck Cancer (DOID:11934)
Kidney Cancer (DOID:263)
Liver Cancer (DOID:3571)
Lung Cancer (DOID:1324)
Prostate Cancer (DOID:10283)
Stomach Cancer (DOID:10534)
Thyroid Gland Cancer (DOID:1781)
Uterine Cancer (DOID:363)
Cervical Cancer (DOID:4362)
Brain Cancer (DOID:1319)
Hematologic Cancer (DOID:2531)
Adrenal Gland Cancer (DOID:3953)
Pancreatic Cancer (DOID:1793)
Ovarian Cancer (DOID:2394)
Skin Cancer (DOID:4159)

Running the Pipeline

To run the BioMuta pipeine, download the scripts from the HIVE Lab github repo: GW HIVE BioMuta Repository.

Pipeline Overview

Step 1: Download

In the downloader step, mutation lists will be downloaded from each source. Refer to each individual source below for downloading instructions.

Download: TCGA

Annotated variant files are downloaded from the ISB-CGC Big Query repository.

Field descriptions for Big Query output available in field_names_descriptions.csv. Additional field descriptions available on GDC docs.

The list of studies used in TCGA can be found here: List of TCGA studies.

Gain access to data

There are two parts to obtaining data from TCGA:

1. Primary TCGA data

Available at NCI Genomic Data Commons.
Accessible through the ISB-CGC Big Query repository (more info TBA). For complete documentation, see the ISB-CGC Read the Docs pages.

2. TCGA controlled-access data

Hosted at dbGaP. For information on how to get access, see Sharepoint.

Run downloader R script using R Studio

Required script: TCGA_mutation_download.R

Run each line one after the other, instead of the whole script at once.

Running library(bigrquery) and calling this library with bq_project_query() (later in the script) will open a browser to login with Google credentials.

Use the Google account registered for a ISB-CGC project and with dbGaP authorization.
After logging in, a token will be saved so that you can login through R Studio instead.

This script will download all mutation data for TCGA.

Since the downloaded file is very large, there might be issues running this script. If this is the case, run the following scripts in the tcga folder:

These scripts will download a set of the TCGA studies, so that the downloaded file size is smaller.

Additional information can be found here: TCGA Additional Information.

Download: CIViC

A VCF for the monthly relaease of accepted variants was downloaded from: https://civicdb.org/releases/main

Download: COSMIC

There are three COSMIC mutation datasets for coding mutations:

COSMIC Complete Mutation Data (Targeted Screens)

A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set.

COSMIC Mutation Data (Genome Screens)

A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing).

COSMIC Mutations Data

A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release.

The COSMIC Mutations Data set was chosen because it combines both the Targeted and Genome Screens.

Downloaded File: COSMIC_SNPs_June_2022.tsv

NOTE Downloading the mutation datasets requires a COSMIC login. With an academic email address, an account can be created for free and the download can be performed.

Fields

The COSMIC dataset contains a large number of fields, many of which were filtered out in order to speed up processing in subsequent steps.

A ‘simplified’ version of the file was used by selecting specific columns from the original downloaded file using the command line tool **awk**

Fields in Simplified Version

Field Name	Example
Accession Number	ENST00000404621.5
Sample name	H_LV-3334-1316090
Primary site	breast
Mutation CDS	c.644C>G
Mutation AA	p.S215*
Mutation genome position	12:1244466234-124466234

All Fields from COSMIC and Field Descriptions

From 'File Description' drop down menu below 'Cosmic Mutation Data' (on downloads page):

Field Name	Description
Gene name	The gene name for which the data has been curated.
Accession Number	The transcript identifier of the gene.
Gene CDS length	Length of the gene (base pair) counts.
HGNC id	If gene is in HGNC, this id helps linking it to HGNC.
Sample name	Sample id, Id tumor A assigned.
Primary Site	The primary tissue/cancer from which the sample originates.
Site Subtype 1	Further sub classification (level 1) of the sample’s tissue.

Download: ICGC

A VCF for release 28 was downloaded from https://dcc.icgc.org/releases/release_28/Summary

Downloaded File: simple_somatic_mutation.aggregated.vcf.gz

Step 2: Convert

In the convert step, all resources are formatted to the Biomuta standard for both data and field structure.

For each resource, a unique script is used to convert from the raw format provided by the resource, to a format aligned with past versions of the Biomuta pipeline.

With a common format, all resources can then be combined into a master dataset.

See the individual resource pages for details on the conversion:

Convert: TCGA

Scripts

process_tcga_download.py

Procedure

Run process_tcga_download.py

Summary

The python script `process_tcga_download.py` will take the output of the TCGA download step and:

Map the data to:

 * uniprot accession
 * doid parent terms

Rename fields
Reformat fields:

 * amino acid change and position
 * chromosome id

Filter out unnecessary fields

Script Specifications

The script must be called from the command line and takes specific command line arguments.

Input

-i: A path to the input csv to reformat
-m: A path to the folder containing all mappings
-d: A path to the tcga study to doid mapping file
-e: A path to the ENSP to uniprot mapping file
-o: A path to the output folder

Output

A data report comparing new AA sites to old AA sites for Biomuta

Usage

`python process_tcga_download.py -h`

 *Gives a description of the necessary commands*

`python process_tcga_download.py -i <path/input_file.vcf> -m <path/> -d <doid_mapping.csv> -e <ensp_mapping.csv> -o <path/>`

 *Runs the script with the given input TCGA CSV and outputs a formatted CSV*

Additional Notes

All the mapping files are available in the repository folder: `pipeline/convert_step2/mapping`

The mapping files used for converting TCGA are:

DOID:

`tcga_doid_mapping.csv`

TCGA Projects were mapped to DOID parent terms using the following table (generated from previous Biomuta mapping):

DO_slim_id	DO_slim_name	TCGA_project
DOID:5041	esophageal cancer	TCGA-ESCA
DOID:2531	hematologic cancer	TCGA-DLBC
DOID:9256	colorectal cancer	TCGA-READ
DOID:1319	brain cancer	TCGA-GBM
DOID:1319	brain cancer	TCGA-LGG
DOID:1781	thyroid cancer	TCGA-THCA
DOID:11054	urinary bladder cancer	TCGA-BLCA
DOID:363	uterine cancer	TCGA-UCEC
DOID:169	neuroendocrine tumor	TCGA-PCPG
DOID:4362	cervical cancer	TCGA-CESC
DOID:363	uterine cancer	TCGA-UCS
DOID:3277	thymus cancer	TCGA-THYM
DOID:3571	liver cancer	TCGA-LIHC
DOID:11934	head and neck cancer	TCGA-HNSC
DOID:2174	ocular cancer	TCGA-UVM
DOID:4159	skin cancer	TCGA-SKCM
DOID:9256	colorectal cancer	TCGA-COAD
DOID:3953	adrenal gland cancer	TCGA-ACC
DOID:1793	pancreatic cancer	TCGA-PAAD
DOID:2994	germ cell cancer	TCGA-TGCT
DOID:1324	lung cancer	TCGA-LUSC
DOID:1790	malignant mesothelioma	TCGA-MESO
DOID:2394	ovarian cancer	TCGA-OV
DOID:1115	sarcoma	TCGA-SARC
DOID:263	kidney cancer	TCGA-KIRP
DOID:263	kidney cancer	TCGA-KICH
DOID:10534	stomach cancer	TCGA-STAD
DOID:2531	hematologic cancer	TCGA-LAML
DOID:10283	prostate cancer	TCGA-PRAD
DOID:1324	lung cancer	TCGA-LUAD
DOID:1612	breast cancer	TCGA-BRCA
DOID:263	kidney cancer	TCGA-KIRC
DOID:263	kidney cancer	TCGA-KICH

Uniprot Accession:

`human_protein_transcriptlocus.csv`

Peptide ID (starts with ENSP) was mapped to uniprot isoform accession.

Mapping was NOT performed to uniprot canonical accession as this resulted in an issue with the final dataset in which a mutation for the same canonical accession would be listed with different amino acid changes.*

Step 3: Combine

In the combined step, all resources are combined into a master dataset.

BioMuta pipeline README

Contents

Description

Running the Pipeline

Pipeline Overview

Step 1: Download

Download: TCGA

Gain access to data

Run downloader R script using R Studio

Download: CIViC

Download: COSMIC

Fields

Fields in Simplified Version

All Fields from COSMIC and Field Descriptions

Download: ICGC

Step 2: Convert

Convert: TCGA

Scripts

Procedure

Summary

Script Specifications

Usage

Additional Notes

Step 3: Combine

Navigation menu

BioMuta pipeline README

Description

Running the Pipeline

Pipeline Overview

Step 1: Download

Download: TCGA

Gain access to data

Run downloader R script using R Studio

Download: CIViC

Download: COSMIC

Fields

Fields in Simplified Version

All Fields from COSMIC and Field Descriptions

Download: ICGC

Step 2: Convert

Convert: TCGA

Scripts

Procedure

Summary

Script Specifications

Usage

Additional Notes

Step 3: Combine

Navigation menu

Search