BioXpress pipeline README: Difference between revisions

From HIVE Lab
Jump to navigation Jump to search
Line 211: Line 211:
<code>python split_per_case.py</code>
<code>python split_per_case.py</code>


=== Output ===
==== Output ====
A folder is generated for each case ID that has a tumor sample and a normal tissue sample. Two files are generated per case: read counts and categories. These files are needed to run DESeq per case.
A folder is generated for each case ID that has a tumor sample and a normal tissue sample. Two files are generated per case: read counts and categories. These files are needed to run DESeq per case.

Revision as of 14:23, 17 October 2024

BioXpress Downloader Step

Step 1 of the BioXpress pipeline

The downloader step will use sample sheets obtained from GDC Data Portal to download raw counts from RNA-Seq for Primary Tumor and Normal Tissue in all available TCGA Studies.

General Flow of Scripts

get_data_all_samples.sh -> get_hits_into_dir.py ->

merge_files_tumor_and_normal.sh

Procedure

Downloader Step 1: Get sample list files from the GDC Data Portal

Summary

Sample sheets are downloaded from the GDC data portal and used for the downstream scripts to obtain read count files.

Method

1. Go to the GDC Repository.

2. Click on the button labeled Advanced Search on the upper right of the repository home page.

  - All filters can also be selected manually using the search tree on the left side of the page at the link above.
  - To select a files filter or a cases filter, that tab must be selected on the search bar.

3. To get the Primary Tumor samples, enter the following query in the query box:

  - files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] 

and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"]

4. Click Submit Query.

5. On the search results screen, click Add All Files To Cart. Then select the Cart on the upper right of the page.

6. Click Sample Sheet from the Cart page to download the Sample Sheet for the Primary Tumor samples.

  - Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add tumor or normal to the filenames as needed.

7. Repeat the process for Normal Tissue samples with the following query:

 - files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] 

and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"]


8. Move both downloaded sample sheets to the server directory: /data/projects/bioxpress/$version/downloads/

- Use a version increment for the new run (e.g., v-5.0) if the latest version is v-4.0.

Downloader Step 2: Run the script get_data_all_samples.sh

Summary

The shell script `get_data_all_samples.sh` provides arguments to the Python script `get_data_all_samples.py`. It generates a log file for creating directories and filtering out TCGA studies with low sample numbers.

Method

1. The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file. Edit the hard-coded paths in the script:

The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file

2. Run the shell script:

sh get_data_all_samples.sh


Output

After the script completes, you will have a folder for each TCGA study with read count files compressed as results.tar.gz. You will also have three log files: One for Tumor samples One for Normal samples A combined log file named get_data_all_samples.log.


Downloader Step 3: Run the script get_hits_into_dir.py

Summary

The Python script get_hits_into_dir.py decompresses read count files and uses the log file to filter out TCGA studies with fewer than 10 Normal Tissue samples. Count files are generated and labeled as intermediate because they will be further manipulated in later Steps

Method

Edit the hard-coded paths in the script:

Line ~12:

with open("/data/projects/bioxpress/$version/downloads/get_data_all_samples.log", 'r') as f:

Line ~44:

topDir = "/data/projects/bioxpress/$version/downloads/"

Run the Python script: python get_hits_into_dir.py

Output

For each TCGA study, a folder named $study_$sampletype_intermediate will be created, containing the read count files.

Downloader Step 4: Run the script merge_files_tumor_and_normal.sh

Summary

The shell script merge_files_tumor_and_normal.sh provides arguments to the Python script merge_files_tumor_and_normal.py. It merges all read count files for Tumor and Normal samples into a single read count file per study.

Method

Edit the paths for variables in_dir and out_dir in the script. Run the shell script: sh merge_files_tumor_and_normal.sh

Output

The out_dir will contain: One read count file for each study. One category file indicating whether a sample ID corresponds to Primary Tumor or Solid Tissue Normal. For checking sample names from previous versions, all lists and logs are moved to:

downloads/v-5.0/sample_lists

BioXpress Annotation Step

Step 2 of the BioXpress pipeline

General Flow of Scripts

merge_per_study.sh -> merge_per_tissue.py -> split_per_case.py

Procedure

Annotation Step 1: Run the script merge_per_study.sh

Summary

The shell script `merge_per_study.sh` provides arguments to the Python script `merge_per_study.py`. This step maps all ENSG IDs to gene symbols based on a set of mapping files. It will also filter out microRNA genes. The steps for creating the mapping files are described in the annotation README.

Method

The mapping files are available in the folder:

/annotation/mapping_files/

These files should be moved to a similar path in the version of your run of BioXpress.

The required mapping files include:

  • `mart_export.txt`
  • `mart_export_remap_retired.txt`
  • `new_mappings.txt`

Edit the hard-coded paths in the script `merge_per_study.sh`:

  • Specify the `in_dir` as the folder containing the final output of the Downloader step, including count and category files per study.
  • Specify the `out_dir` so that it is now in the top folder: generated/annotation instead of:

downloads

  • Specify the location of the mapping files downloaded in the previous sub-step.


Output

All ENSG IDs in the counts files have been replaced by gene symbols in new count files located in the `out_dir`. Transcripts have also been merged per gene and microRNA genes filtered out. The categories files remain the same but are copied over to the annotation folder.

Annotation Step 2: Run the script merge_per_tissue.py

Summary

The Python script `merge_per_tissue.py` takes all files created by the script `merge_per_study.sh` and merges these files based on the file `tissues.csv`, which assigns TCGA studies to specific tissues terms.

Method

Download the file `tissues.csv` from the previous version of BioXpress at: /data/projects/bioxpress/$version/generated/misc/tissues.csv

Place it in a similar folder in the version of your run of BioXpress.

Edit the hard-coded paths in `merge_per_tissue.py`:

  • **Edit the line (~23):**
 in_file = "/data/projects/bioxpress/v$version/generated/misc/tissues.csv"  
 with the version for your current run of BioXpress.
  • **Edit the line (~36):**
 out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.htseq.counts" % (tissue_id)  
 with the version for your current run of BioXpress.
  • **Edit the line (~37):**
 out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_tissue/%s.categories" % (tissue_id)  
 with the version for your current run of BioXpress.
  • **Edit the line (~45):**
 in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.categories" % (study_id)  
 with the version for your current run of BioXpress.
  • **Edit the line (~52):**
 in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id)  
 with the version for your current run of BioXpress.

Run the python script python merge_per_tissue.py

Output

Read count and category files are generated for each tissue specified in the tissues.csv file.


Annotation Step 3: Run the script split_per_case.py

Summary

The python script `split_per_case.py` takes case and sample IDs from the sample sheets downloaded from the GDC data portal and splits annotation data so that there is one folder per case with only that case’s annotation data.

Method

Edit the hard-coded paths in `split_per_case.py`

  • Edit the line (line ~29) in_file = "/data/projects/bioxpress/v-5.0/generated/misc/studies.csv" with the version for your current run of BioXpress
  • Edit the line (line ~38) in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.primary_tumor.tsv" with the version for your current run of BioXpress as well as the same of the sample sheet for tumor samples downloaded from the GDC data portal
  • Edit the line (line ~57) in_file = "/data/projects/bioxpress/v-5.0/downloads/sample_list_from_gdc/gdc_sample_sheet.solid_tissue_normal.tsv" with the version for your current run of BioXpress as well as the same of the sample sheet for normal samples downloaded from the GDC data portal
  • Edit the line (line ~81) out_file_one = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.htseq.counts" % (study_id,case_id) with the version for your current run of BioXpress
  • Edit the line (line ~82) out_file_two = "/data/projects/bioxpress/v-5.0/generated/annotation/per_case/%s.%s.categories" % (study_id,case_id) with the version for your current run of BioXpress
  • Edit the line (line ~85) in_file = "/data/projects/bioxpress/v-5.0/generated/annotation/per_study/%s.htseq.counts" % (study_id) with the version for your current run of BioXpress

Run the python script: python split_per_case.py

Output

A folder is generated for each case ID that has a tumor sample and a normal tissue sample. Two files are generated per case: read counts and categories. These files are needed to run DESeq per case.