BioXpress pipeline README

BioXpress Downloader Step

Step 1 of the BioXpress pipeline

The downloader step will use sample sheets obtained from GDC Data Portal to download raw counts from RNA-Seq for Primary Tumor and Normal Tissue in all available TCGA Studies.

General Flow of Scripts

get_data_all_samples.sh -> get_hits_into_dir.py -> merge_files_tumor_and_normal.sh

Procedure

Downloader Step 1: Get sample list files from the GDC Data Portal

Summary

Sample sheets are downloaded from the GDC data portal and used for the downstream scripts to obtain read count files.

Method

1. Go to the GDC Repository. 2. Click on the button labeled Advanced Search on the upper right of the repository home page.

  - All filters can also be selected manually using the search tree on the left side of the page at the link above.
  - To select a files filter or a cases filter, that tab must be selected on the search bar.

3. To get the Primary Tumor samples, enter the following query in the query box:

files.analysis.workflow_type in ["HTSeq - Counts"] and files.data_type in ["Gene Expression Quantification"] and cases.samples.sample_type in ["Primary Tumor"] and cases.project.program.name in ["TCGA"]