BioXpress pipeline README: Difference between revisions
(26 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
get_data_all_samples.sh -> get_hits_into_dir.py -> | get_data_all_samples.sh -> get_hits_into_dir.py -> | ||
merge_files_tumor_and_normal.sh | merge_files_tumor_and_normal.sh | ||
Line 18: | Line 19: | ||
==== Method ==== | ==== Method ==== | ||
1. Go to the [https://portal.gdc.cancer.gov/ GDC Repository]. | 1. Go to the [https://portal.gdc.cancer.gov/ GDC Repository]. | ||
2. Click on the button labeled '''Advanced Search''' on the upper right of the repository home page. | 2. Click on the button labeled '''Advanced Search''' on the upper right of the repository home page. | ||
- All filters can also be selected manually using the search tree on the left side of the page at the link above. | - All filters can also be selected manually using the search tree on the left side of the page at the link above. | ||
Line 23: | Line 25: | ||
3. To get the Primary Tumor samples, enter the following query in the query box: | 3. To get the Primary Tumor samples, enter the following query in the query box: | ||
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] | - files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] | ||
and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"] | and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"] | ||
4. Click '''Submit Query'''. | 4. Click '''Submit Query'''. | ||
5. On the search results screen, click '''Add All Files To Cart'''. Then select the '''Cart''' on the upper right of the page. | 5. On the search results screen, click '''Add All Files To Cart'''. Then select the '''Cart''' on the upper right of the page. | ||
6. Click '''Sample Sheet''' from the Cart page to download the Sample Sheet for the Primary Tumor samples. | 6. Click '''Sample Sheet''' from the Cart page to download the Sample Sheet for the Primary Tumor samples. | ||
- Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add '''tumor''' or '''normal''' to the filenames as needed. | - Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add '''tumor''' or '''normal''' to the filenames as needed. | ||
Line 33: | Line 37: | ||
7. Repeat the process for Normal Tissue samples with the following query: | 7. Repeat the process for Normal Tissue samples with the following query: | ||
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] | - files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] | ||
and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"] | and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"] | ||
Line 56: | Line 60: | ||
sh get_data_all_samples.sh | sh get_data_all_samples.sh | ||
== | ==== Output ==== | ||
''Step | |||
After the script completes, you will have a folder for each TCGA study with read count files compressed as results.tar.gz. | |||
You will also have three log files: | |||
One for Tumor samples | |||
One for Normal samples | |||
A combined log file named get_data_all_samples.log. | |||
'''Downloader Step 3: Run the script get_hits_into_dir.py''' | |||
==== Summary ==== | |||
The Python script get_hits_into_dir.py decompresses read count files and uses the log file to filter out TCGA studies with fewer than 10 Normal Tissue samples. Count files are generated and labeled as intermediate because they will be further manipulated in later Steps | |||
==== Method ==== | |||
Edit the hard-coded paths in the script: | |||
Line ~12: | |||
with open("/data/projects/bioxpress/$version/downloads/get_data_all_samples.log", 'r') as f: | |||
Line ~44: | |||
topDir = "/data/projects/bioxpress/$version/downloads/" | |||
Run the Python script: | |||
python get_hits_into_dir.py | |||
==== Output ==== | |||
For each TCGA study, a folder named | |||
$study_$sampletype_intermediate | |||
will be created, containing the read count files. | |||
'''Downloader Step 4: Run the script merge_files_tumor_and_normal.sh''' | |||
==== Summary ==== | |||
The shell script merge_files_tumor_and_normal.sh provides arguments to the Python script merge_files_tumor_and_normal.py. It merges all read count files for Tumor and Normal samples into a single read count file per study. | |||
==== Method ==== | |||
Edit the paths for variables in_dir and out_dir in the script. | |||
Run the shell script: | |||
sh merge_files_tumor_and_normal.sh | |||
==== Output ==== | |||
The out_dir will contain: | |||
One read count file for each study. | |||
One category file indicating whether a sample ID corresponds to Primary Tumor or Solid Tissue Normal. | |||
For checking sample names from previous versions, all lists and logs are moved to: | |||
downloads/v-5.0/sample_lists | |||
== BioXpress Annotation Step == | |||
''Step 2 of the BioXpress pipeline'' | |||
=== General Flow of Scripts === | === General Flow of Scripts === | ||
merge_per_study.sh -> merge_per_tissue.py -> split_per_case.py | |||
=== Procedure === | === Procedure === | ||
''' | '''Annotation Step 1: Run the script merge_per_study.sh''' | ||
==== Summary ==== | |||
The shell script `merge_per_study.sh` provides arguments to the Python script `merge_per_study.py`. This step maps all ENSG IDs to gene symbols based on a set of mapping files. It will also filter out microRNA genes. The steps for creating the mapping files are described in the annotation README. | |||
==== Method ==== | |||
The mapping files are available in the folder: | |||
/annotation/mapping_files/ | |||
These files should be moved to a similar path in the version of your run of BioXpress. | |||
The required mapping files include: | |||
* `mart_export.txt` | |||
* `mart_export_remap_retired.txt` | |||
* `new_mappings.txt` | |||
Edit the hard-coded paths in the script `merge_per_study.sh`: | |||
* Specify the `in_dir` as the folder containing the final output of the Downloader step, including count and category files per study. | |||
* Specify the `out_dir` so that it is now in the top folder: generated/annotation instead of: | |||
downloads | |||
* Specify the location of the mapping files downloaded in the previous sub-step. | |||
=== Output === | |||
All ENSG IDs in the counts files have been replaced by gene symbols in new count files located in the `out_dir`. Transcripts have also been merged per gene and microRNA genes filtered out. The categories files remain the same but are copied over to the annotation folder. | |||
=== Annotation Step 2: Run the script merge_per_tissue.py === | |||
==== Summary ==== | |||
The Python script `merge_per_tissue.py` takes all files created by the script `merge_per_study.sh` and merges these files based on the file `tissues.csv`, which assigns TCGA studies to specific tissues terms. | |||
==== Method ==== | |||
Download the file `tissues.csv` from the previous version of BioXpress at: | |||
<code>/data/projects/bioxpress/$version/generated/misc/tissues.csv</code> | |||
Place it in a similar folder in the version of your run of BioXpress. | |||
Edit the hard-coded paths in `merge_per_tissue.py`: | |||
* **Edit the line (~23):** | |||
<code >in_file = "/data/projects/bioxpress/v$version/generated/misc/tissues.csv"</code> | |||
with the version for your current run of BioXpress. | |||
* **Edit the line (~36):** | |||
<code >out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.htseq.counts" % (tissue_id)</code> | |||
with the version for your current run of BioXpress. | |||
* **Edit the line (~37):** | |||
<code >out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.categories" % (tissue_id)</code> | |||
with the version for your current run of BioXpress. | |||
* **Edit the line (~45):** | |||
<code >in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.categories" % (study_id)</code> | |||
with the version for your current run of BioXpress. | |||
* **Edit the line (~52):** | |||
<code >in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id)</code> | |||
with the version for your current run of BioXpress. | |||
Run the python script <code> python merge_per_tissue.py </code> | |||
==== Output==== | |||
Read count and category files are generated for each tissue specified in the tissues.csv file. | |||
=== Annotation Step 3: Run the script split_per_case.py === | |||
==== Summary ==== | ==== Summary ==== | ||
The python script `split_per_case.py` takes case and sample IDs from the sample sheets downloaded from the GDC data portal and splits annotation data so that there is one folder per case with only that case’s annotation data. | |||
==== Method ==== | ==== Method ==== | ||
Edit the hard-coded paths in `split_per_case.py` | |||
* Edit the line (line ~29) <code>in_file = "/data/projects/bioxpress/v-5.0/generated/misc/studies.csv"</code> with the version for your current run of BioXpress | |||
* Edit the line (line ~38) <code>in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.primary_tumor.tsv"</code> with the version for your current run of BioXpress as well as the same of the sample sheet for tumor samples downloaded from the GDC data portal | |||
3. | * Edit the line (line ~57) <code>in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.solid_tissue_normal.tsv"</code> with the version for your current run of BioXpress as well as the same of the sample sheet for normal samples downloaded from the GDC data portal | ||
* Edit the line (line ~81) <code>out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.htseq.counts" % (study_id,case_id)</code> with the version for your current run of BioXpress | |||
* Edit the line (line ~82) <code>out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.categories" % (study_id,case_id)</code> with the version for your current run of BioXpress | |||
* Edit the line (line ~85) <code>in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id)</code> with the version for your current run of BioXpress | |||
Run the python script: | |||
<code>python split_per_case.py</code> | |||
==== Output ==== | |||
A folder is generated for each case ID that has a tumor sample and a normal tissue sample. Two files are generated per case: read counts and categories. These files are needed to run DESeq per case. | |||
== BioXpress DESeq Step == | |||
'''Step 3 of the BioXpress pipeline''' | |||
=== General Flow of Scripts === | |||
<code>run_per_study.py</code> -> <code>run_per_tissue.py</code> -> <code>run_per_case.py</code> | |||
=== Procedure === | |||
==== DESeq Step 1: Run the script <code>run_per_study.sh</code> ==== | |||
'''Summary''' | |||
The python script <code>run_per_study.py</code> provides arguments to the R script <code>deseq.R</code>. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per tissue including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot. | |||
* '''Note:''' This step is time consuming (~2-3 hours of run time) | |||
'''Method''' | |||
Edit the hard-coded paths in the script <code>run_per_tissue.py</code>: | |||
* Specify the <code>in_dir</code> to be the folder containing the final output files of the Annotation steps for per study | |||
* Specify the <code>out_dir</code> | |||
* Ensure that the file <code>list_files/studies.csv</code> contains all of the tissues you wish to process - '''Note:''' the studies can be run separately (in the event that 2-3 hours cannot be dedicated to run all the studies at once) by creating separate dat files with specific tissues to run | |||
Run the shell script: | |||
<code>sh run_per_study.sh</code> | |||
* '''Note:''' the R libraries specified in <code>deseq.R</code> will need to be installed if running on a new server or system, as these installations are not included in the scripts | |||
'''Output''' | |||
A set of files: | |||
* log file | |||
* <code>deSeq_reads_normalized.csv</code> - Normalized read counts (DESeq normalization method applied) | |||
* <code>results_significance.csv</code> - log2fc differential expression results and statistical significance (t-test) | |||
* <code>dispersion.png</code> | |||
* <code>distance_heatmap.png</code> | |||
* <code>pca.png</code> - Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together | |||
==== DESeq Step 2: Run the script <code>run_per_tissue.sh</code> ==== | |||
'''Summary''' | |||
The python script <code>run_per_tissue.py</code> provides arguments to the R script <code>deseq.R</code>. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per study including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot. | |||
* '''Note:''' This step is time consuming (~2-3 hours of run time) | |||
'''Method''' | |||
Edit the hard-coded paths in the script <code>run_per_tissue.py</code>: | |||
* Specify the <code>in_dir</code> to be the folder containing the final output files of the Annotation steps for per tissue | |||
* Specify the <code>out_dir</code> | |||
* Ensure that the file <code>list_files/tissue.dat</code> contains all of the tissues you wish to process - '''Note:''' the tissues can be run separately by creating specific dat files | |||
Run the shell script: | |||
<code>sh run_per_tissue.sh</code> | |||
'''Output''' | |||
A set of files: | |||
* log file | |||
* <code>deSeq_reads_normalized.csv</code> - Normalized read counts (DESeq normalization method applied) | |||
* <code>results_significance.csv</code> - log2fc differential expression results and statistical significance (t-test) | |||
* <code>dispersion.png</code> | |||
* <code>distance_heatmap.png</code> | |||
* <code>pca.png</code> - Principal component analysis plot | |||
==== DESeq Step 3: Run the script <code>run_per_case.sh</code> ==== | |||
'''Summary''' | |||
The python script <code>run_per_case.py</code> provides arguments to the R script <code>deseq.R</code>. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per case including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot. | |||
* '''Note:''' This step is time consuming (~2-3 hours of run time) | |||
'''Method''' | |||
Edit the hard-coded paths in the script <code>run_per_case.py</code>: | |||
* Specify the <code>in_dir</code> to be the folder containing the final output files of the Annotation step for per case | |||
* Specify the <code>out_dir</code> | |||
* Ensure that the file <code>list_files/cases.csv</code> contains all of the cases you wish to process | |||
Run the shell script: | |||
<code>sh run_per_case.sh</code> | |||
'''Output''' | |||
A set of files: | |||
* log file | |||
* <code>deSeq_reads_normalized.csv</code> - Normalized read counts (DESeq normalization method applied) | |||
* <code>results_significance.csv</code> - log2fc differential expression results and statistical significance (t-test) | |||
* <code>dispersion.png</code> | |||
* <code>distance_heatmap.png</code> | |||
* <code>pca.png</code> - Principal component analysis plot | |||
== BioXpress Publisher Step == | |||
'''Step 4 of the BioXpress pipeline''' | |||
=== General Flow of Scripts === | |||
<code>de-publish-per-study.py</code> -> <code>de-publish-per-tissue.py</code> | |||
=== Procedure === | |||
==== Publisher Step 1: Run the script <code>de-publish-per-study.py</code> ==== | |||
'''Summary''' | |||
The python script <code>de-publish-per-study.py</code> takes the output from running DESeq in the previous step for each TCGA study and combines it into one master file. | |||
'''Method''' | |||
Edit the hard-coded paths in the script <code>de-publish-per-study.py</code>: | |||
* Specify the <code>in_file</code> for the disease ontology mapping file (line ~26) | |||
* Specify the <code>in_file</code> for the uniprot accession id (protein id) mapping file (line ~40) | |||
* Specify the <code>in_file</code> for the refseq mapping file (line ~51) | |||
* Specify the <code>in_file</code> for the list of TCGA studies to include in the final output (line ~72) | |||
* Specify the <code>deseq_dir</code> for the folder containing all deseq output (line ~80) | |||
* Specify the path to write the output (line ~135) | |||
Run the python script: | |||
<code>python de-publish-per-study.py</code> | |||
'''Output''' | |||
A csv file with the DESeq output for all TCGA studies, mapped to DO IDs, uniprot accession ids, and refseq ids. The path is specified in the script as one of the hard-coded lines edited during the method. | |||
==== Publisher Step 2: Run the script <code>de-publish-per-tissue.py</code> ==== | |||
'''Summary''' | |||
The python script <code>de-publish-per-tissue.py</code> takes the output from running DESeq in the previous step for each tissue and combines it into one master file. | |||
'''Method''' | |||
Edit the hard-coded paths in the script <code>de-publish-per-study.py</code>: | |||
* Specify the <code>in_file</code> for the disease ontology mapping file (line ~26) | |||
* Specify the <code>in_file</code> for the uniprot accession id (protein id) mapping file (line ~40) | |||
* Specify the <code>in_file</code> for the refseq mapping file (line ~51) | |||
* Specify the <code>in_file</code> for the list of tissues to include in the final output (line ~72) | |||
* Specify the <code>deseq_dir</code> for the folder containing all deseq output (line ~80) | |||
* Specify the path to write the output (line ~135) | |||
Run the python script: | |||
<code>python de-publish-per-tissue.py</code> | |||
'''Output''' | |||
A csv file with the DESeq output for all tissues, mapped to DO IDs, uniprot accession ids, and refseq ids. The path is specified in the script as one of the hard-coded lines edited during the method. | |||
== Major Changes from v-4.0 == | |||
[[Major updates to the BioXpress from the previous version (v-4.0)]] | |||
==Post-processing for OncoMX and Glygen== | |||
''Processing done for integration of BioXpress data into OncoMX and Glygen.'' | |||
===Processing for OncoMX=== | |||
The final output from BioXpress v-5.0 is available on the OncoMX-tst server at the path: | |||
<code>/software/pipeline/integrator/downloads/bioxpress/v-5.0/</code> | |||
For OncoMX, the <code>de_per_tissue.csv</code> is used to report gene expression per tissue, however [https://data.oncomx.org data.oncomx.org] hosts both per tissue and per study datasets. The files are processed with the recipe pipeline. The recipes filter for all genes that are successfully mapped to uniprotkb accession IDs. | |||
'''Recipes''' | |||
<code>human_cancer_mRNA_expression_per_study.json</code> | |||
<code>human_cancer_mRNA_expression_per_tissue.json</code> | |||
The output is available on the OncoMX-tst server at the path: | |||
<code>/software/pipeline/integrator/unreviewed</code> | |||
files. | '''Final output files''' | ||
<code>human_cancer_mRNA_expression_per_study.csv</code> | |||
<code>human_cancer_mRNA_expression_per_tissue.csv</code> | |||
===Processing for Glygen=== | |||
The final output from BioXpress v-5.0 was modified to align with the previous input for cancer gene expression and now includes the following columns: | |||
* '''pmid''' | |||
* '''sample_name''' | |||
** Same as DOID and name | |||
* '''parent_doid''' | |||
** Same as DOID | |||
** All DOIDs in v-5.0 are parent terms | |||
* '''parent_doname''' | |||
** Same as DOID and name | |||
** All DOIDs in v-5.0 are parent terms | |||
* '''sample_id''' | |||
** Taken from previous version, unclear on the origin of these numbers | |||
The following mapping for the column <code>sample_id</code> was recovered from the previous version and mapped to DOIDs present in v-5.0: | |||
{| class="wikitable" | |||
! sample_name !! sample_id | |||
|- | |||
| DOID:10283 / Prostate cancer [PCa] || 42 | |||
|- | |||
| DOID:10534 / Stomach cancer [Stoca] || 19 | |||
|- | |||
| DOID:11054 / Urinary bladder cancer [UBC] || 34 | |||
|- | |||
| DOID:11934 / Head and neck cancer [H&NC] || 46 | |||
|- | |||
| DOID:1612 / Breast cancer [BRCA] || 70 | |||
|- | |||
| DOID:1781 / Thyroid cancer [Thyca] || 16 | |||
|- | |||
| DOID:234 / Colon adenocarcinoma || 3 | |||
|- | |||
| DOID:263 / Kidney cancer [Kidca] & Kidney renal clear cell carcinoma || 61 | |||
|- | |||
| DOID:3571 / Liver cancer [Livca] || 60 | |||
|- | |||
| DOID:3907 / Lung squamous cell carcinoma || 33 | |||
|- | |||
| DOID:3910 / Lung adenocarcinoma || 53 | |||
|- | |||
| DOID:4465 / Papillary renal cell carcinoma || 57 | |||
|- | |||
| DOID:4471 / Chromophobe adenocarcinoma || 23 | |||
|- | |||
| DOID:5041 / Esophageal cancer [EC] || 32 | |||
|} | |||
The processed file for Glygen is available on the glygen-vm-dev server at: | |||
<code>/software/pipeline/integrator/downloads/bioxpress/August_2021/human_cancer_mRNA_expression_per_tissue_glygen.csv</code> |
Latest revision as of 14:49, 17 October 2024
BioXpress Downloader Step
Step 1 of the BioXpress pipeline
The downloader step will use sample sheets obtained from GDC Data Portal to download raw counts from RNA-Seq for Primary Tumor and Normal Tissue in all available TCGA Studies.
General Flow of Scripts
get_data_all_samples.sh -> get_hits_into_dir.py ->
merge_files_tumor_and_normal.sh
Procedure
Downloader Step 1: Get sample list files from the GDC Data Portal
Summary
Sample sheets are downloaded from the GDC data portal and used for the downstream scripts to obtain read count files.
Method
1. Go to the GDC Repository.
2. Click on the button labeled Advanced Search on the upper right of the repository home page.
- All filters can also be selected manually using the search tree on the left side of the page at the link above. - To select a files filter or a cases filter, that tab must be selected on the search bar.
3. To get the Primary Tumor samples, enter the following query in the query box:
- files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"]
and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"]
4. Click Submit Query.
5. On the search results screen, click Add All Files To Cart. Then select the Cart on the upper right of the page.
6. Click Sample Sheet from the Cart page to download the Sample Sheet for the Primary Tumor samples.
- Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add tumor or normal to the filenames as needed.
7. Repeat the process for Normal Tissue samples with the following query:
- files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"]
and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"]
8. Move both downloaded sample sheets to the server directory:
/data/projects/bioxpress/$version/downloads/
- Use a version increment for the new run (e.g., v-5.0) if the latest version is v-4.0.
Downloader Step 2: Run the script get_data_all_samples.sh
Summary
The shell script `get_data_all_samples.sh` provides arguments to the Python script `get_data_all_samples.py`. It generates a log file for creating directories and filtering out TCGA studies with low sample numbers.
Method
1. The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file. Edit the hard-coded paths in the script:
The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file
2. Run the shell script:
sh get_data_all_samples.sh
Output
After the script completes, you will have a folder for each TCGA study with read count files compressed as results.tar.gz. You will also have three log files: One for Tumor samples One for Normal samples A combined log file named get_data_all_samples.log.
Downloader Step 3: Run the script get_hits_into_dir.py
Summary
The Python script get_hits_into_dir.py decompresses read count files and uses the log file to filter out TCGA studies with fewer than 10 Normal Tissue samples. Count files are generated and labeled as intermediate because they will be further manipulated in later Steps
Method
Edit the hard-coded paths in the script:
Line ~12:
with open("/data/projects/bioxpress/$version/downloads/get_data_all_samples.log", 'r') as f:
Line ~44:
topDir = "/data/projects/bioxpress/$version/downloads/"
Run the Python script: python get_hits_into_dir.py
Output
For each TCGA study, a folder named $study_$sampletype_intermediate will be created, containing the read count files.
Downloader Step 4: Run the script merge_files_tumor_and_normal.sh
Summary
The shell script merge_files_tumor_and_normal.sh provides arguments to the Python script merge_files_tumor_and_normal.py. It merges all read count files for Tumor and Normal samples into a single read count file per study.
Method
Edit the paths for variables in_dir and out_dir in the script. Run the shell script: sh merge_files_tumor_and_normal.sh
Output
The out_dir will contain: One read count file for each study. One category file indicating whether a sample ID corresponds to Primary Tumor or Solid Tissue Normal. For checking sample names from previous versions, all lists and logs are moved to:
downloads/v-5.0/sample_lists
BioXpress Annotation Step
Step 2 of the BioXpress pipeline
General Flow of Scripts
merge_per_study.sh -> merge_per_tissue.py -> split_per_case.py
Procedure
Annotation Step 1: Run the script merge_per_study.sh
Summary
The shell script `merge_per_study.sh` provides arguments to the Python script `merge_per_study.py`. This step maps all ENSG IDs to gene symbols based on a set of mapping files. It will also filter out microRNA genes. The steps for creating the mapping files are described in the annotation README.
Method
The mapping files are available in the folder:
/annotation/mapping_files/
These files should be moved to a similar path in the version of your run of BioXpress.
The required mapping files include:
- `mart_export.txt`
- `mart_export_remap_retired.txt`
- `new_mappings.txt`
Edit the hard-coded paths in the script `merge_per_study.sh`:
- Specify the `in_dir` as the folder containing the final output of the Downloader step, including count and category files per study.
- Specify the `out_dir` so that it is now in the top folder: generated/annotation instead of:
downloads
- Specify the location of the mapping files downloaded in the previous sub-step.
Output
All ENSG IDs in the counts files have been replaced by gene symbols in new count files located in the `out_dir`. Transcripts have also been merged per gene and microRNA genes filtered out. The categories files remain the same but are copied over to the annotation folder.
Annotation Step 2: Run the script merge_per_tissue.py
Summary
The Python script `merge_per_tissue.py` takes all files created by the script `merge_per_study.sh` and merges these files based on the file `tissues.csv`, which assigns TCGA studies to specific tissues terms.
Method
Download the file `tissues.csv` from the previous version of BioXpress at:
/data/projects/bioxpress/$version/generated/misc/tissues.csv
Place it in a similar folder in the version of your run of BioXpress.
Edit the hard-coded paths in `merge_per_tissue.py`:
- **Edit the line (~23):**
in_file = "/data/projects/bioxpress/v$version/generated/misc/tissues.csv"
with the version for your current run of BioXpress.
- **Edit the line (~36):**
out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.htseq.counts" % (tissue_id)
with the version for your current run of BioXpress.
- **Edit the line (~37):**
out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.categories" % (tissue_id)
with the version for your current run of BioXpress.
- **Edit the line (~45):**
in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.categories" % (study_id)
with the version for your current run of BioXpress.
- **Edit the line (~52):**
in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id)
with the version for your current run of BioXpress.
Run the python script python merge_per_tissue.py
Output
Read count and category files are generated for each tissue specified in the tissues.csv file.
Annotation Step 3: Run the script split_per_case.py
Summary
The python script `split_per_case.py` takes case and sample IDs from the sample sheets downloaded from the GDC data portal and splits annotation data so that there is one folder per case with only that case’s annotation data.
Method
Edit the hard-coded paths in `split_per_case.py`
- Edit the line (line ~29)
in_file = "/data/projects/bioxpress/v-5.0/generated/misc/studies.csv"
with the version for your current run of BioXpress - Edit the line (line ~38)
in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.primary_tumor.tsv"
with the version for your current run of BioXpress as well as the same of the sample sheet for tumor samples downloaded from the GDC data portal - Edit the line (line ~57)
in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.solid_tissue_normal.tsv"
with the version for your current run of BioXpress as well as the same of the sample sheet for normal samples downloaded from the GDC data portal - Edit the line (line ~81)
out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.htseq.counts" % (study_id,case_id)
with the version for your current run of BioXpress - Edit the line (line ~82)
out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.categories" % (study_id,case_id)
with the version for your current run of BioXpress - Edit the line (line ~85)
in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id)
with the version for your current run of BioXpress
Run the python script:
python split_per_case.py
Output
A folder is generated for each case ID that has a tumor sample and a normal tissue sample. Two files are generated per case: read counts and categories. These files are needed to run DESeq per case.
BioXpress DESeq Step
Step 3 of the BioXpress pipeline
General Flow of Scripts
run_per_study.py
-> run_per_tissue.py
-> run_per_case.py
Procedure
DESeq Step 1: Run the script run_per_study.sh
Summary
The python script run_per_study.py
provides arguments to the R script deseq.R
. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per tissue including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: This step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_tissue.py
:
- Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per study - Specify the
out_dir
- Ensure that the file
list_files/studies.csv
contains all of the tissues you wish to process - Note: the studies can be run separately (in the event that 2-3 hours cannot be dedicated to run all the studies at once) by creating separate dat files with specific tissues to run
Run the shell script:
sh run_per_study.sh
- Note: the R libraries specified in
deseq.R
will need to be installed if running on a new server or system, as these installations are not included in the scripts
Output A set of files:
- log file
deSeq_reads_normalized.csv
- Normalized read counts (DESeq normalization method applied)results_significance.csv
- log2fc differential expression results and statistical significance (t-test)dispersion.png
distance_heatmap.png
pca.png
- Principal component analysis plot, important for observing how well the Primary Tumor and Solid Tissue Normal group together
DESeq Step 2: Run the script run_per_tissue.sh
Summary
The python script run_per_tissue.py
provides arguments to the R script deseq.R
. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per study including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: This step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_tissue.py
:
- Specify the
in_dir
to be the folder containing the final output files of the Annotation steps for per tissue - Specify the
out_dir
- Ensure that the file
list_files/tissue.dat
contains all of the tissues you wish to process - Note: the tissues can be run separately by creating specific dat files
Run the shell script:
sh run_per_tissue.sh
Output A set of files:
- log file
deSeq_reads_normalized.csv
- Normalized read counts (DESeq normalization method applied)results_significance.csv
- log2fc differential expression results and statistical significance (t-test)dispersion.png
distance_heatmap.png
pca.png
- Principal component analysis plot
DESeq Step 3: Run the script run_per_case.sh
Summary
The python script run_per_case.py
provides arguments to the R script deseq.R
. The count and category files generated from the Annotation step are used to calculate differential expression and statistical significance. The result is a series of files per case including the normalized reads (DESeq normalization method), the DE results and significance, and QC files such as the PCA plot.
- Note: This step is time consuming (~2-3 hours of run time)
Method
Edit the hard-coded paths in the script run_per_case.py
:
- Specify the
in_dir
to be the folder containing the final output files of the Annotation step for per case - Specify the
out_dir
- Ensure that the file
list_files/cases.csv
contains all of the cases you wish to process
Run the shell script:
sh run_per_case.sh
Output A set of files:
- log file
deSeq_reads_normalized.csv
- Normalized read counts (DESeq normalization method applied)results_significance.csv
- log2fc differential expression results and statistical significance (t-test)dispersion.png
distance_heatmap.png
pca.png
- Principal component analysis plot
BioXpress Publisher Step
Step 4 of the BioXpress pipeline
General Flow of Scripts
de-publish-per-study.py
-> de-publish-per-tissue.py
Procedure
Publisher Step 1: Run the script de-publish-per-study.py
Summary
The python script de-publish-per-study.py
takes the output from running DESeq in the previous step for each TCGA study and combines it into one master file.
Method
Edit the hard-coded paths in the script de-publish-per-study.py
:
- Specify the
in_file
for the disease ontology mapping file (line ~26) - Specify the
in_file
for the uniprot accession id (protein id) mapping file (line ~40) - Specify the
in_file
for the refseq mapping file (line ~51) - Specify the
in_file
for the list of TCGA studies to include in the final output (line ~72) - Specify the
deseq_dir
for the folder containing all deseq output (line ~80) - Specify the path to write the output (line ~135)
Run the python script:
python de-publish-per-study.py
Output A csv file with the DESeq output for all TCGA studies, mapped to DO IDs, uniprot accession ids, and refseq ids. The path is specified in the script as one of the hard-coded lines edited during the method.
Publisher Step 2: Run the script de-publish-per-tissue.py
Summary
The python script de-publish-per-tissue.py
takes the output from running DESeq in the previous step for each tissue and combines it into one master file.
Method
Edit the hard-coded paths in the script de-publish-per-study.py
:
- Specify the
in_file
for the disease ontology mapping file (line ~26) - Specify the
in_file
for the uniprot accession id (protein id) mapping file (line ~40) - Specify the
in_file
for the refseq mapping file (line ~51) - Specify the
in_file
for the list of tissues to include in the final output (line ~72) - Specify the
deseq_dir
for the folder containing all deseq output (line ~80) - Specify the path to write the output (line ~135)
Run the python script:
python de-publish-per-tissue.py
Output A csv file with the DESeq output for all tissues, mapped to DO IDs, uniprot accession ids, and refseq ids. The path is specified in the script as one of the hard-coded lines edited during the method.
Major Changes from v-4.0
Major updates to the BioXpress from the previous version (v-4.0)
Post-processing for OncoMX and Glygen
Processing done for integration of BioXpress data into OncoMX and Glygen.
Processing for OncoMX
The final output from BioXpress v-5.0 is available on the OncoMX-tst server at the path:
/software/pipeline/integrator/downloads/bioxpress/v-5.0/
For OncoMX, the de_per_tissue.csv
is used to report gene expression per tissue, however data.oncomx.org hosts both per tissue and per study datasets. The files are processed with the recipe pipeline. The recipes filter for all genes that are successfully mapped to uniprotkb accession IDs.
Recipes
human_cancer_mRNA_expression_per_study.json
human_cancer_mRNA_expression_per_tissue.json
The output is available on the OncoMX-tst server at the path:
/software/pipeline/integrator/unreviewed
Final output files
human_cancer_mRNA_expression_per_study.csv
human_cancer_mRNA_expression_per_tissue.csv
Processing for Glygen
The final output from BioXpress v-5.0 was modified to align with the previous input for cancer gene expression and now includes the following columns:
- pmid
- sample_name
- Same as DOID and name
- parent_doid
- Same as DOID
- All DOIDs in v-5.0 are parent terms
- parent_doname
- Same as DOID and name
- All DOIDs in v-5.0 are parent terms
- sample_id
- Taken from previous version, unclear on the origin of these numbers
The following mapping for the column sample_id
was recovered from the previous version and mapped to DOIDs present in v-5.0:
sample_name | sample_id |
---|---|
DOID:10283 / Prostate cancer [PCa] | 42 |
DOID:10534 / Stomach cancer [Stoca] | 19 |
DOID:11054 / Urinary bladder cancer [UBC] | 34 |
DOID:11934 / Head and neck cancer [H&NC] | 46 |
DOID:1612 / Breast cancer [BRCA] | 70 |
DOID:1781 / Thyroid cancer [Thyca] | 16 |
DOID:234 / Colon adenocarcinoma | 3 |
DOID:263 / Kidney cancer [Kidca] & Kidney renal clear cell carcinoma | 61 |
DOID:3571 / Liver cancer [Livca] | 60 |
DOID:3907 / Lung squamous cell carcinoma | 33 |
DOID:3910 / Lung adenocarcinoma | 53 |
DOID:4465 / Papillary renal cell carcinoma | 57 |
DOID:4471 / Chromophobe adenocarcinoma | 23 |
DOID:5041 / Esophageal cancer [EC] | 32 |
The processed file for Glygen is available on the glygen-vm-dev server at:
/software/pipeline/integrator/downloads/bioxpress/August_2021/human_cancer_mRNA_expression_per_tissue_glygen.csv