ARGOSQC Usage Tutorial: Difference between revisions

From HIVE Lab
Jump to navigation Jump to search
No edit summary
No edit summary
Line 15: Line 15:
|}
|}


= Table of Contents =
Objective


Required User Information
Table of Contents
Overview
Input values:
Where to locate NCBI Information for the inputs:
Single QC Computation
Batch Mode Computation
Batch Mode Parameter breakdown:
QC Computation Results


= Overview =
= Overview =

Revision as of 19:43, 25 April 2025


HIVE3 one-click pipeline tutorial for the FDA HIVE instance. This protocol will guide the user in running single and batch-mode QC computations. HIVE3 is an instance of HIVE not owned by the FDA which can be directly modified by Vahan or others on our team with permissions.

Required User Information

Protocol Version 1.0
HIVE Instance 3
HIVE Link https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=login


Overview

We constructed a QC one-click pipeline that takes user specified organism information and combines the 3 core ARGOS workflows to produce 5 different result datasets in JSON format (Figure 1). 3 out of the 5 result JSONs (assemblyQC, ngsQC, and biosampleMeta) have been updated and added to data.argosdb.

To register your account, navigate to the link under the “Required User Information”. At the top right there will be a tab saying “register”. Fill out the appropriate fields and submit. Once submitted, please email mazumder_lab@gwu.edu so we can verify your account.

The ARGOS pipeline can be accessed via the dropdown menu ‘Projects’ at the upper right hand screen and then under ‘Argos’. The pipeline in HIVE3 is located in the “Required User Information” section in the beginning of this protocol.

After a successful login, you will be navigated to the home page. Use the menu at the top right corner under projects to access the ARGOS pipeline or use this URL to access the ARGOS QC pipeline on HIVE3:

https://hive3.biochemistry.gwu.edu/dna.cgi?cmd=argos-alqc

Figure 1. Input settings page for the ARGOS QC pipeline.

Input values:

On the ARGOS Pipeline input setting page, under the General tab, is where our data inputs for a single and batch computation will be taken.

Name: Give the computation a name.

Folder: Give the folder where your computations, data, and steps will be stored.

  • Can use: _ or -  or &  and letters + numbers
  • Cannot use: / : ; , \ “ ” ‘ ’
  • Yet, you use / to create a sub folder, but that is not recommended. Manually moving the subfolder is best.
  • Ex folder: Influenza A (h5n1)

Reads: information needed for ngsQC

  • SRR: the SRR accession number, can be multiple per organism by using a “,” or populate extra fields by clicking on the gray + sign. This tool uses the NCBI SRA Fasterq function to grab the fastq files directly from NCBI without the user needing to import them to HIVE.  
  • HIVE reads: Drop down menu can select reads already uploaded into HIVE, either from previous computations or manual uploads.
  • See in Figure 2a

NOTE: The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the SRR input box, rather the HIVE IDs menu, but you can if you want to, it will just search within HIVE. See ngsQC Protocol for how to upload SRR information using external downloader. This external downloader process is the same as in HIVE2 and 1.

Reference: Information used for the assemblyQC portion of the algorithm.

  • Reference accession: This is the REFSEQ or Genbank accession number from NCBI or Genbank.
  • Assembly ID: This is the ASSEMBLY accession number from NCBI.
  • HIVE genome: Use the drop down menu to select a reference genome that has already been uploaded into HIVE.
  • See in Figure 2b

NOTE: The algorithm will automatically select information already in HIVE as opposed to pulling information from outside sources. If the reads are already uploaded you do not need to use the assembly ID or reference accession input box, but you can if you want to. See AssemblyQC Protocol for how to upload assembly information using an external downloader or local upload. This process is the same as what is done in HIVE2.

Metadata:  Used to grab information necessary to fill out the BiosampleMeta_HIVE document.

  • Biosample Accession: The optional accession number for the Biosample that was reported to be used when creating the assembly and will be linked to the SRR fastq files used for the ngsQC portion of the algorithm. This step is optional.

Coding Table: Dropdown of genetic codon tables to be used for your computation, depending on the organism to be computed. The default is human, viral (Standard).

  • Tip: NCBI Taxonomy will list the codon table for each organism on their taxonomy page.
Figure 2. The ARGOS_QC algorithm page input with all NCBI information for a test organism, Salmonella enterica. a) The SRR accession field contains both SRR fastqs for the organism that correspond to the biosample.

b) The Reference Accession is the RefSeq Nucleotide accession number from NCBI, the Assembly ID is the assembly accession number from NCBI, and the Biosample Accession is the Biosample accession number from NCBI.

a & b) HIVE IDs are accessions that are already uploaded into HIVE and the algorithm will automatically select this information as opposed to pulling information from outside sources.

Almost all of this information can be found in the legacy assembly page for this organism, shown in Figure 3, 4, 5, 6, 7.

a)
b)

Where to locate NCBI Information for the inputs:

Navigate to the NCBI legacy assembly page for your organism. Here you can find all of the information to be used for your computations.

Figure 3. The information shown in the NCBI legacy assembly page. The RefSeq assembly accession corresponds to our organism of interest and will be used to fill out the “Assembly ID” input field on the HIVE3 ARGOS_QC input page. Please note that the bioproject matches the accession for the FDA_ARGOS bioproject, and there are 2 sequencing technologies listed, meaning that there will most likely be two SRR submissions that we can find on the SRA page (see Figure 5 and 6).
Figure 4. The bottom section of the legacy assembly page for our test organism. The column labeled RefSeq sequence lists the DNA RefSeq for our test organism which we will use for the “Reference Accession” field on the HIVE3 ARGOS_QC input page.
Figure 5. Clicking on the Biosample accession number seen in Figure 3 will redirect you to the NCBI Biosample page for this organism. Under “Related Information”, click “SRA” to navigate to the NCBI SRA page for this biosample.
Figure 6. The SRA page lists different sequencing links. Each link is reported to be sequenced on different platforms; either for Illumina or for the PacBio platform. This is common. The methodology behind using the different platforms is to gather insight for the assembly at different perspectives and levels. Illumina sequences DNA as multiple short reads that can be used to create an accurate reconstruction of the genomic sections analyzed by estimating the average/best fit nucleotide sequence. PacBio is a long read sequencer that takes “movies” of the DNA sequence as it moves through the technology and captures the sequence in one go from start to finish. The long read sequence then acts as a map for the short accurate short read sequences that need to be assembled. Therefore, it is important to use all of the links reported in our QC pipeline.
Figure 7. Clicking on the bottom link under the Runs section from the SRA page shown in Figure 6 will redirect to the page containing the SRR file. Copy and paste the SRR accession number into the input field in the HIVE3 pipeline labeled “SRR”. You will need to do this for both (or more) SRR accessions. Check that the bioproject, biosample, and organism name all coincide with our test organism.

Single QC Computation

A single QC computation will allow for assemblyQC, biosampleQC, and ngsQC to be performed on one organism with one assembly, but can include multiple SRR ids.

Step 1: Input the name of the computation and the name of the folder you wish to store the computation in. If you would like to add to a pre-existing folder, input the exact folder name in the folder input field. It is case sensitive. See input criteria at the beginning of this protocol.

Step 2: Under the dropdown menu for Reads, select which input you will use.

  • For SRRs, type or paste in the SRR id. If there is more than one SRR id, click on the gray + sign to populate a new input field or use a , .
    • Troubleshooting: if the computation fails, try removing the spaces in between the commas and SRR ids. No spaces.
  • For HIVE IDs, click on the HIVE ID option from the dropdown menu. Click on the gray dropdown menu arrow next to HIVE reads. A pop-up window will open, as seen in Figure 9.
    • Click on the ids that you wish to use in the computation. Use Ctrl + shift to highlight multiple ids.
Figure 8. Input settings page for the ARGOS QC pipeline filled out for a single QC computation of our test organism, Salmonella enterica.
Figure 9. Pop-up window for SRR HIVE ID selection.

Step 3: Next to Reference, select the input object you would like to use for the computation.

  • For Reference Accession, type or paste in the id you wish to use.
  • For Assembly IDs, type or paste in the id you wish to use.
  • For HIVE Genomes, refer to Step 2 above on how to select a HIVE id. It is the same process.
  • Refer to the beginning of this protocol for what ids can be inputted.

Step 4: Next under BioSample Accessions, paste in the biosample ID you would like to use for the computation. This is optional.

Step 5: Lastly, select from the Coding Table dropdown menu the genetic code you would like to use for your computation, if applicable. The default is “human, viral (Standard)”.

Step 6: Once all of the information has been inputted correctly, click on the big blue Submit button in the middle, as seen in Figure 8.

Step 7: Once your computation has been submitted, you can click on the Home tab found in the top left to go back to the homepage and see the workflow.

Batch Mode Computation

Batch mode operates by a user-specified ratio of groups. With the help of semicolons and commas, the ratio would be 1:1:1 for a batch mode computation. It is 1:1:1, because we are clustering them by ; so the pipeline recognizes the ids between the ; as one computation. It would be one cluster of SRRs to one assembly to one biosample, that is one computation. There is a colorful and highlighted example below displaying the syntax for the inputs.

They would be grouped for computations like this example:

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

To separate between batches, use a semicolon “;” in between the IDs. A comma denotes separate IDs but semicolons as separate batches. These will be inputted in the General tab of the pipeline, same as single computation. Within each field, this is how the above example would look like in batch mode:

SRR IDS: SRR0123456, SRR0123457, SRR0123458, SRR0123459; SRR0123451, SRR0123452, SRR0123453, SRR0123454

Assembly IDs: GCA_0011223344.1; GCA_0011223345.1

BioSample Accessions: SAMN110654321, SAMN110654322; SAMN110654323, SAMN110654324

Notice the semicolon separation according to the example above. The commas separate the ids, and the semicolon the batch.

Troubleshooting Note: If your computation fails or if there is an error, remove the spaces between the , and ; .Previously, this has thrown an error but has been fixed, but worth a shot if your computation fails. It would look like:

SRR IDS: SRR0123456,SRR0123457,SRR0123458,SRR0123459;SRR0123451,SRR0123452,SRR0123453,SRR0123454

Step 1: Navigate to the tab title Batch, Figure 10. This can be found on the ARGOS input settings page, Figure 1., next to the General tab.

Step 2: For the parameter “batch service" at the bottom select, from the dropdown menu batch mode. This will have the pipeline set to Batch Mode rather than single computations.

Figure 10. Batch mode input settings window.

Step 3: Selecting the parameters. Click on the drop down menu next to the text “Parameter list”.

  • Use the black plus button next to ‘Parameter List’ to populate an entry field.
  • Select from the dropdown field the correct parameter based on the input field you used in the general input page. This can be seen in Figure 11.
  • For example, if you pasted in SRR Ids you would choose the parameter SRR IDs. If you chose HIVEIDs you would select HIVE IDs from the dropdown.
Figure 11. Dropdown menu from the Batch tab displaying the parameter options. Select the batch parameter option based on what the input information is in the general tab. For example, if you entered SRR ids, select SRR IDs from the dropdown. If you input a reference ID, select Reference IDs.

Step 4: Input the ratio for the batch service.

  • For computations in batch mode in the one-click pipeline, the computations are separated by semicolon “;” and the IDs within the computations by a comma “,”. Since the workflow will parse the computations and recognize the IDs between the “;” as one computation, the ratio will be 1:1:1.
  • If the ratio is 1:1:1 then enter the value 1 for each box.
  • One set of SRRs to one assembly to one biosample.

Step 5: Inputting the information correctly in each field. Navigate back to the input settings page, Figure 1. The same page that you had used for the single computations will be used, but the only difference will be the semicolons and commas. The example below will visually show you how the information will be inputted for a batch mode.

Batch Mode Parameter breakdown:

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Assembly 1:

SRR0123456, SRR0123457, SRR0123458, SRR0123459

GCA_0011223344.1

SAMN110654321, SAMN110654322

Assembly 2:

SRR0123451, SRR0123452, SRR0123453, SRR0123454

GCA_0011223345.1

SAMN110654323, SAMN110654324

Again, this is very similar to single computations, except that the batch mode will use semicolons and commas to separate the ids.

Step 6: Once all of the input information is complete, hit the blue button Submit. You may exit the Argos pipeline window by hitting “Home” on the top left corner.

QC Computation Results

Once you have submitted your computations, either single or batch, you should see the pipeline workflow in your Inbox or All Objects, as shown in Figure 12. You can also view the pipeline by clicking on the “workflows” tab also seen in Figure 12.

Figure 12.  The pipeline workflow displayed in the user’s inbox.

As the workflow progresses, your computations will be stored in the folder that you named from the beginning of this protocol. To view the contents of the folder, simply click on the plus sign next to the folder or the folder name to open.

Once your computations are complete the QC outputs will be stored in JSON file format from the computation “Post-Alignment Quality Controls” or under the “CFlow” workflow. P-A QC can be found in the folder you specified for the computation or CFlow in All Objects. To view the JSONs click on the name so that it is highlighted blue and click on the tab from the bottom menu named “Available Downloads”.

Figure 13. The available downloads tab and the 5 JSON files that are the QC outputs.

There will be 5 files reported in JSON format. Click the blue/green download icon next to each file to see the results. The file labeled qcAll.json will have our assemblyQC results. qcNGS.json will have our ngsQC results and biosample.json the biosample information. We currently do not submit qcPos.json or refAnnot.json to the ARGOS DB, but the information is there to better help you understand your computation.