BioXpress pipeline README
BioXpress Downloader Step
Step 1 of the BioXpress pipeline
The downloader step will use sample sheets obtained from GDC Data Portal to download raw counts from RNA-Seq for Primary Tumor and Normal Tissue in all available TCGA Studies.
General Flow of Scripts
get_data_all_samples.sh -> get_hits_into_dir.py -> merge_files_tumor_and_normal.sh
Procedure
Downloader Step 1: Get sample list files from the GDC Data Portal
Summary
Sample sheets are downloaded from the GDC data portal and used for the downstream scripts to obtain read count files.
Method
1. Go to the GDC Repository. 2. Click on the button labeled Advanced Search on the upper right of the repository home page.
- All filters can also be selected manually using the search tree on the left side of the page at the link above. - To select a files filter or a cases filter, that tab must be selected on the search bar.
3. To get the Primary Tumor samples, enter the following query in the query box:
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"]
4. Click Submit Query. 5. On the search results screen, click Add All Files To Cart. Then select the Cart on the upper right of the page. 6. Click Sample Sheet from the Cart page to download the Sample Sheet for the Primary Tumor samples.
- Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add tumor or normal to the filenames as needed.
7. Repeat the process for Normal Tissue samples with the following query:
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"]
8. Move both downloaded sample sheets to the server directory:
/data/projects/bioxpress/$version/downloads/
- Use a version increment for the new run (e.g., v-5.0) if the latest version is v-4.0.
Downloader Step 2: Run the script get_data_all_samples.sh
Summary
The shell script `get_data_all_samples.sh` provides arguments to the Python script `get_data_all_samples.py`. It generates a log file for creating directories and filtering out TCGA studies with low sample numbers.
Method
1. The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file. Edit the hard-coded paths in the script:
The shell script will call the python script once for the tumor samples and once for the normal sample, so for both tumor and normal you will need to specify the path to the appropriate sample sheet and the path to the log file
2. Run the shell script:
sh get_data_all_samples.sh
Here’s the wiki markup for the BioXpress Downloader Step section, based on the image you provided:
BioXpress Downloader Step
Step 1 of the BioXpress pipeline
The downloader step will use sample sheets obtained from GDC Data Portal to download raw counts from RNA-Seq for Primary Tumor and Normal Tissue in all available TCGA Studies.
General Flow of Scripts
get_data_all_samples.sh -> get_hits_into_dir.py -> merge_files_tumor_and_normal.sh
Procedure
Downloader Step 1: Get sample list files from the GDC Data Portal
Summary
Sample sheets are downloaded from the GDC data portal and used for the downstream scripts to obtain read count files.
Method
1. Go to the GDC Repository. 2. Click on the button labeled Advanced Search on the upper right of the repository home page.
- All filters can also be selected manually using the search tree on the left side of the page at the link above. - To select a files filter or a cases filter, that tab must be selected on the search bar.
3. To get the Primary Tumor samples, enter the following query in the query box:
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"]
4. Click Submit Query.
5. On the search results screen, click Add All Files To Cart. Then select the Cart on the upper right of the page.
6. Click Sample Sheet from the Cart page to download the Sample Sheet for the Primary Tumor samples.
- Rename the sample sheet to avoid overwriting it when downloading the Normal Tissue samples. Add tumor or normal to the filenames as needed.
7. Repeat the process for Normal Tissue samples with the following query:
files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] and cases.samples.sample_type in ["Solid Tissue Normal"] and cases.project.program.name in ["TCGA"]
Summary
The shell script `get_data_all_samples.sh` provides arguments to the Python script `get_data_all_samples.py`. It generates a log file for creating directories and filtering out TCGA studies with low sample numbers.
Method
1. Edit the hard-coded paths in the script: path0 = "/data/projects/bioxpress/$version/downloads/"