Skip to main content

Submitting Molecular Data

Molecular data consists of raw molecular data files (e.g. sequencing reads), as well as any associated file metadata (data that describes your data).

Raw molecular data is submitted to a Regional Data Processing Centre (RDPC). RDPCs are responsible for processing your program's molecular data according to the Analysis Pipeline. If you are unsure which RDPC you should submit to, please contact the DCC.

IMPORTANT: Please make sure to upgrade the icgc-argo/argo-data-submission workflow to version 1.1.0 or later.

note

Sample Registration is the first step in the data submission process. You must register samples before submitting molecular data. Please ensure that your samples are registered on the ARGO Data Platform before continuing with this step.

Molecular Data Submission Workflow

icgc-argo/argo-data-submission is a workflow designed to ease the metadata payload generation and streamline molecular data submission processes. It applies sanity checks and data validation rules early on to ensure high quality of the molecular data and accompanying metadata which are to be released through the ARGO Data Platform. Initially, it was designed to support submissions for local data, but it can also work on submitting remote data stored in EGA archives. Sequencing data of FASTQ, BAM and CRAM type files are all supported, with the exception of interleaved FASTQ files.

The workflow is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this workflow uses one container per process which makes it much easier to maintain and update software dependencies. Here you can download the latest version.

Workflow Diagram

workflow-digram

Workflow Summary

By default, it is assumed that Sequencing Reads are submitted from local storage. However if the Sequencing Reads are stored remotely in EGA Archives, it is easy to enable EGA download sub-workflow to support getting data from EGA and submit to RDPC. The workflow also adds support to Sequencing Reads in CRAM format. For this, the reference genome is required to enable cram2bam step. Currently the workflow performs the following steps:

  • Sanity Checks (sanityCheck)
  • [Optional] Download data from EGA Archives (EgaDownloadWf)
  • [Optional] Cram to Bam conversion (cram2bam)
  • Generate Metadata Payload (pGenExp)
  • Validate Molecular Data and Metadata (valSeq)
  • Upload Files to RDPC (uploadWf)
  • Generate submission receipt (submissionReceipt)

Quick Start

  1. Install Nextflow (>=22.10.0)
  2. Install Docker or Singularity for full workflow reproducibility
  3. Download the workflow and do a dry run submission with the following commands
git clone https://github.com/icgc-argo/argo-data-submission.git
cd argo-data-submission
nextflow run -preview main.nf -profile rdpc,docker -params-file example-params.local.json --api_token YOUR_API_TOKEN

This will give you a list of processes that will be invoked with the given input params for submitting data from local storage. E.g,

N E X T F L O W  ~  version 23.10.0
* PREVIEW * null [confident_mahavira] DSL2 - revision: d1e7ce6cd4
[- ] process > ArgoDataSubmissionWf:sanityCheck -
[- ] process > ArgoDataSubmissionWf:checkCramReference -
[- ] process > ArgoDataSubmissionWf:pGenExp -
[- ] process > ArgoDataSubmissionWf:valSeq -
[- ] process > ArgoDataSubmissionWf:uploadWf:songSub -
[- ] process > ArgoDataSubmissionWf:uploadWf:songMan -
[- ] process > ArgoDataSubmissionWf:uploadWf:scoreUp -
[- ] process > ArgoDataSubmissionWf:uploadWf:songPub -
[- ] process > ArgoDataSubmissionWf:submissionReceipt -
[- ] process > ArgoDataSubmissionWf:printOut -

Please verify if you’ve provided the required params to invoke the expected processes.

  1. Start submitting your own data following the steps in section How to Upload Molecular Data

How to Upload Molecular Data

Molecular data files are submitted in conjunction with necessary, descriptive data about molecular files and samples, called metadata. The DCC requires that metadata is submitted at the same time as the molecular data to facilitate automated downstream analysis. This metadata is one of the mandatory inputs for the data submission workflow. Before proceeding, please read the instructions on how to prepare and format the metadata according to molecular metadata model and validation rules.

Step 1. Prepare Molecular Metadata

The first step of submitting molecular data is to prepare the metadata TSV files conforming to the most recent metadata dictionary that has been defined by the DCC.

The ARGO Data Platform currently accepts the following types of genomic sequencing analyses:

  • DNA-Seq data from tumour and normal paired Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WXS)
  • DNA-Seq data from clinical genome Targeted Sequencing (Targeted-Seq)
  • RNA-Seq data from Transcriptome Sequencing (RNA-Seq)

There are submissions for either Tumour or Normal samples, from Paired-end vs. Single-end sequencing, in either BAM, FASTQ or CRAM formats. The data submission workflow can be configured to submit either local data or remote data stored in EGA archives. Each of these scenarios has a specific set of metadata rules to follow.

Experiment

Table experiment contains data elements related to a specific experiment performed for a sample. It should only contain one experiment/row entry. Download the template and format it according to the latest dictionary.

The following are examples of correctly formatted metadata files for each case:

  • Normal Sample
typesubmitter_sequencing_experiment_idprogram_idsubmitter_donor_idsubmitter_specimen_idsubmitter_sample_idsubmitter_matched_normal_sample_idsequencing_centerplatformplatform_modelexperimental_strategysequencing_dateread_group_counttarget_capture_kitprimary_target_regionscapture_target_regionsnumber_of_genesgene_paddingcoveragelibrary_selectionlibrary_preparation_kitlibrary_strandednessrindv200
sequencing_experimentexp_wgsTEST-CAtest_wgstest_wgs_SP_Ntest_wgs_SA_NOICRILLUMINAHiSeq 2000WGS2014-12-123Illumina - TruSeq DNA Nano
  • Tumour Sample
typesubmitter_sequencing_experiment_idprogram_idsubmitter_donor_idsubmitter_specimen_idsubmitter_sample_idsubmitter_matched_normal_sample_idsequencing_centerplatformplatform_modelexperimental_strategysequencing_dateread_group_counttarget_capture_kitprimary_target_regionscapture_target_regionsnumber_of_genesgene_paddingcoveragelibrary_selectionlibrary_preparation_kitlibrary_strandednessrindv200
sequencing_experimentexp_wgsTEST-CAtest_wgstest_wgs_SP_Ttest_wgs_SA_Ttest_wgs_SA_NOICRILLUMINAHiSeq 2000WGS2014-12-123Illumina - TruSeq DNA Nano
note
  • Fields program_id, submitter_[donor|specimen|sample]_id are Required, and must be the same values as in the sample_registration table.
  • Field submitter_matched_normal_sample_id is Required for WGS/WXS tumour samples.
  • Fields submitter_sequencing_experiment_id, experimental_strategy, platform and read_group_count are Required for all samples.
  • Fields target_capture_kit, primary_target_regions and capture_target_regions are Required for Targeted-Seq/WXS samples.
  • Field library_strandedness is Required for RNA-Seq samples.

Read_groups

Table read_groups contains data elements related to sequencing runs from an NGS experiment. Each row will correspond with a read_group. The total number of rows should match the value of field “read_group_count” in experiment.tsv Download the template and format it according to the latest dictionary.

The following are examples of correctly formatted metadata files for each case:

typesubmitter_sequencing_experiment_idsubmitter_read_group_idread_group_id_in_bamplatform_unitis_paired_endfile_r1file_r2library_nameread_length_r1read_length_r2insert_sizesample_barcode
read_groupexp_01C0HVY.2C0HVY.274_8atruetest_rg_3.bamtest_rg_3.bamPond-147580150150298AGCTTACC
read_groupexp_01D0RE2.1D0RE2.174_8btruetest_rg_3.bamtest_rg_3.bamPond-147580150150298AGCTTACC
read_groupexp_01D0RH0.2D0RH0.274_8ctruetest_rg_3.bamtest_rg_3.bamPond-147580150150298AGCTTACC
note
  • Fields submitter_sequencing_experiment_id, submitter_read_group_id, platform_unit, is_paired_end, and library_name are Required for all read_groups.
  • Field read_group_id_in_bam is Required for BAM/CRAM files, however this can NOT be submitted for FASTQ files.
  • Field file_r1 is Required, and both file_r1 and file_r2 must match a fileName identified in the files section.
  • Fields file_r2, read_length_r2 and insert_size are Required if and only if paired-end sequencing was done.

Files

Table files contains data elements related to the submitted files. Each row will correspond with a file object. Download the template and format it according to the latest dictionary.

The following are examples of correctly formatted metadata files for each case:

typenameformatsizemd5sumpathega_file_idega_dataset_idega_run_idega_sample_idega_experiment_id
filetest_rg_3.bamBAM14245178f97f7b1ca8bfc28fd5586bdd56799input/test_rg_3.bam
note
  • Fields name, format, size, md5sum are Required for all files.
  • Field path is Required for both local data (use the file path relative to the directory you run the workflow) and data downloaded from EGA through Aspera (use the file path relative to Aspera root directory).
  • Field ega_file_id is Required for data downloaded from EGA.

Step 2. Run the Data Submission Workflow

Once you have formatted the metadata TSV files correctly, use the icgc-argo/argo-data-submission workflow to submit your data. You can get YOUR_API_TOKEN by following the instructions

Submit Local Data

nextflow run main.nf \
-profile rdpc,<singularity/docker> \
--study_id YOUR_Program_ID \
--download_mode local \
--experiment_info_tsv path/to/experiment.tsv \
--read_group_info_tsv path/to/read_groups.tsv \
--file_info_tsv path/to/files.tsv \
--api_token YOUR_API_TOKEN

note
  • Field path in table files.tsv is Required for local data and it will have to be formatted using the file path relative to the directory you run the data submission workflow

Submit Remote Data Stored in EGA Archives

If your data is hosted on EGA Archives, the pipeline provides two methods for you to download the data from EGA, then resubmit them to ARGO.

nextflow run main.nf \
-profile rdpc,<singularity/docker> \
--study_id YOUR_Program_ID \
--download_mode aspera \
--experiment_info_tsv path/to/experiment.tsv \
--read_group_info_tsv path/to/read_groups.tsv \
--file_info_tsv path/to/files.tsv \
--ascp_scp_host ascp_scp_host \
--ascp_scp_user ascp_scp_user \
--aspera_scp_pass aspera_scp_pass \
--c4gh_pass_phrase c4gh_pass_phrase \
--c4gh_secret_key c4gh_secret_key \
--api_token YOUR_API_TOKEN

note
  • You need to contact with EGA helpdesk to setup the aspera dbox containing your datasets, and get the credentials for ascp_scp_host, ascp_scp_user, aspera_scp_pass
  • Since data present in your dbox is encrypted, you will be requested to provide EGA with an alternative email address to get the decryption key c4gh_pass_phrase.
  • The c4gh_secret_key will be a file found within your Aspera dbox repository.
  • Field path in table files.tsv is Required to run with this mode and it will have to be formatted using the file path relative to Aspera root directory.
  • Field ega_file_id in table files.tsv is Required to run with this mode. Please retrieve the EGAF IDs from https://ega-archive.org/ or the EGA API.
  • You do NOT need to install the aspera client and crypt4gh at your local to run the data submission workflow, however you will need to login your aspera dbox to retrieve the files’ relative path and get the c4gh_secret_key.

Step 3. Verify Submitted Data

Once your data has been successfully submitted through the above data submission workflow, you will receive a message providing the following information about your submitted data.

  • Payload JSON File
  • Analysis ID
  • Submission TSV Receipt

This indicates that your data has been submitted successfully and will be queued for downstream molecular data processing.

On the other hand, you can also verify your submitted data on your Program Dashboard by viewing the card Donor Data Summary. It should display an increase in the number of Raw Reads corresponding to the submitted sample in the appropriate Tumour/Normal column.

note
  • There are different tabs according to DNA-Seq and RNA-seq respectively.
  • There are some delays between the submitted data that are viewable on the Program Dashboard and the completion of your submission. Please contact the DCC admin if the delay persists after 24 hours.

Understanding the Metadata Model

Graph Representation

The molecular metadata model is represented as a graph with nodes and edges. It maintains the critical relationship between entities including program, donor, specimen, sample, experiment, read groups and files. The node/edge structure is depicted below:

metadata-model

Each blue node represents an entity, edges between nodes represent relationships between entities. There is a lot of metadata associated with entities, which are represented by yellow nodes.

Metadata Dictionary

The following describes the attributes and permissible values for all of the fields within the metadata tsv files for the ARGO Data Platform and Molecular Data Processing.

FieldAttributeDescriptionPermissible ValuesNote
typeRequiredtable typesequencing_experiment
submitter_sequencing_experiment_idRequiredUnique identifier of the sequencing experiment, assigned by the data provider.String values that meet the regular expression ^[A-Za-z0-9\-\.\\_]{1,98}\$
program_idRequiredARGO Program ID, the unique identifier of your program. If you have logged into the ARGO Data Platform, this is the Program ID that you see in the Program Services area.
submitter_donor_idRequiredUnique identifier of the donor, assigned by the data provider.Values must meet the regular expression ^[A-Za-z0-9-.\\_]{1,64}\$Must be the same value as in sample_registration table
submitter_specimen_idRequiredUnique identifier of the specimen, assigned by the data provider.Values must meet the regular expression ^[A-Za-z0-9-.\\_]{1,64}\$Must be the same value as in sample_registration table
submitter_sample_idRequiredUnique identifier of the sample, assigned by the data provider. If submitted along with BAM molecular data, must also be present in header SM.Values must meet the regular expression ^[A-Za-z0-9-.\\_]{1,64}\$Must be the same value as in sample_registration table
submitter_matched_normal_sample_idConditionalProvide the identifier of matched normal sample used for data analysis.Values must meet the regular expression ^[A-Za-z0-9-._]{1,64}\$ or empty(null)Required for WGS/WXS tumour samples
read_group_countRequiredThe number of read groups in the molecular files being submitted.A minimum of 1 is required.
platformRequiredThe sequencing platform type used in data generation.CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, ONT, PACBIO, Nanopore, BGI
experimental_strategyRequiredThe primary experimental method. For sequencing data it refers to how the sequencing library was made.WGS, WXS, RNA-Seq, Bisulfite-Seq, ChIP-Seq, Targeted-Seq
sequencing_dateOptionalThe date of sequencingdatetime format, for example: 2019-06-16 or 2019-06-16T20:20:39+00:00 or empty(null)
platform_modelOptionalThe model number of the sequencing machine used in data generation.Any string value or empty(null)
sequencing_centerOptionalData centre sequencing was performed. Can also be specified with BAM header CN.Any string value or empty(null)
target_capture_kitConditionalDescription that can uniquely identify a target capture kit. Suggested value is a combination of vendor, kit name, and kit version.Any string value or empty(null)Required for Targeted-Seq/WXS
primary_target_regionsConditionalA bed file which holds the biologically relevant target regions (based on a genome, e.g. GRCh38) to capture by the assay.⚡ Customized Enum values which can be mapped to fileName and fileURL maintained by DCCRequired for Targeted-Seq/WXS
capture_target_regionsConditionalA bed file which holds the technically relevant probes regions to capture by the assay.⚡ Customized Enum values which can be mapped to fileName and fileURL maintained by DCCRequired for Targeted-Seq/WXS
number_of_genesOptionalNumber of genes the assay is targetingInteger with a minimum value of 1 or empty(null).Optional for Targeted-Seq
gene_paddingOptionalNumber of basepairs to add to exon endpoints for the inBED filterInteger with a minimum value of 0 or empty(null).Optional for Targeted-Seq
coverageOptionalList of coverageHotspot Regions, Coding Exons, Introns, Promoters, or empty(null)Optional for Targeted-Seq
library_selectionOptionalThe method used to select and/or enrich the material being sequenced.Affinity Enrichment, Hybrid Selection, miRNA Size Fractionation, Other, PCR, Poly-T Enrichment, Random, rRNA Depletion, or empty(null)Optional for WXS, Targeted-Seq and RNA-Seq
library_preparation_kitOptionalProvide the kit information being used for library construction. Suggested value is a combination of vendor, kit name, and kit version.String or empty(null)
library_strandednessConditionalIndicate the library strandednessUNSTRANDED, FIRST_READ_SENSE_STRAND, FIRST_READ_ANTISENSE_STRAND, or empty(null)Required for RNA-Seq
rinOptionalA numerical assessment of the integrity of RNA based on the entire electrophoretic trace of the RNA sample including the presence or absence of degradation products.A number between 1 to 10 or empty(null)Optional for RNA-Seq
dv200OptionalThe percentage of RNA fragments that are >200 nucleotides in sizeA percentage, for example: 95% or empty(null)Optional for RNA-Seq
note

⚡ Please contact DCC to create the Customized Enum values. ⚡

Molecular Data and Metadata Validation Rules

It is very important that molecular data is submitted with valid metadata. As a helpful tool for metadata validation, DCC has developed seq-tools as a stand-alone client and incorporate it as one of the mandatory steps in the data submission workflow to run on your data before officially submitting to RDPC.

Validation will help you ensure your data is formatted correctly (with accurate identifier assignments between metadata and molecular data files) and that your submission goes smoothly. It also helps the DCC ensure that the downstream Analysis Pipelines will function seamlessly.

The seq-tools wiki lists all of the validation rules that ARGO RDPC enforces, separated by the following four categories.

  • Metadata General Sanity Checks
  • Sequencing Data Integrity Checks
  • Sequencing Data and Metadata Cross Checks
  • RNA-Seq Specific Checks

Understanding the Validation Report

Validation report are separated into five categories:

  • PASS: This status indicates that the payload(s) are ready for submission.
  • PASS-with-WARNING: This status indicates that the payload(s) are ready for submission, however there may be a parameter you want to double check before submission.
  • PASS-with-SKIPPED-check: This status indicates that the payload(s) are ready for submission, however you've skipped the md5sum check that you may want to double check before submission.
  • PASS-with-WARNING-and-SKIPPED-check: This status indicates that the payload(s) are ready for submission, however you've skipped md5sum check and there are some parameters that you may want to double check before submission.
  • INVALID: This status indicates that the payload(s) are not ready for submission, and each reason for failure is listed. Errors must be fixed before you attempt to submit this payload.

Troubleshooting & FAQs

  • Error in sanityCheck
  1. Unregistered Project. E.g.,
Project LUNCHTIME does not exist or no samples have been registered

This error message indicates that an incorrect project code, check if any typos occurred or verify program exists in dashboard.

  1. Unregistered donor/specimen/sample. E.g.,
submitter_donor_id:'BATMAN' was not found in project:'TEST-PR'. Verify sample has been registered.

This error message indicates that the required Donor-Specimen-Sample data has not been registered to the ARGO Data Platform.

  1. Unregistered matched normal sample. E.g.,
'submitter_matched_normal_sample_id':SubWf_tes was not found in study. Please verify 'SubWf_tes' has been registered.

This error message indicates that the required matched normal sample data has not been registered to the ARGO Data Platform.

  1. Project registered but not in SONG. E.g.,
Program TEST-JP does not exist in SONG. Please verify program code is correct. Otherwise contact DCC-admin for help to troubleshoot.

Since clinical and molecular metadata are managed by separate databases, upon program registration, DCC-admin will create entries into both. This error message indicates that the program is not correctly created in SONG database.

  1. Sample with existing analysis. E.g.,
Sample 'TEST_SUBMITTER_SAMPLE_ID_ujolwwdsmgN1'/'SA623974' has an existing published analysis '15e3ffd2-16a2-465d-a3ff-d216a2765d4f' for experiment_strategy 'WGS.'

Resubmitting an existing sample will result in the above error. Special circumstances such as replacing existing data is allowed. Please contact DCC-admin for help and instructions.

  1. Mismatched valid IDs. E.g.,
ID Mismatch detected. Specimen_id:'TEST_SUBMITTER_SPECIMEN_ID_ujolwwdsmgN1'/'SP223585' was not found within Donor:'DN108'/'DO263239' 's specimens

SP223585 is valid but associated with another donor. Please review Donor-Specimen-Sample data found on the ARGO Program Dashboard to make sure you are submitting the correct samples.

  • Error in pGenExp
  1. Missing library_strandedness. E.g.,
'experimental_strategy' 'RNA-Seq' specified but 'library_strandedness' is missing. Resubmit with both values 'experimental_strategy' and 'library_strandedness'

Check table experiment and ensure library_strandedness is provided when experimental_strategy==RNA-Seq.

  1. Missing target_capture_kit, primary_target_regions or capture_target_regions
'experimental_strategy' 'WXS' specified but 'target_capture_kit' is missing. Resubmit with both values 'experimental_strategy' and 'target_capture_kit'

Check table experiment and ensure target_capture_kit, primary_target_regions and capture_target_regions are all provided when experimental_strategy is Targeted-Seq or WXS.

  1. Missing files. E.g.,
No such file: /Users/ubuntu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/input/D0RH0.2_r2.fq.gz
-- Check script '/Users/ubuntu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/main.nf' at line: 154 or see '.nextflow.log' file for more details

A file path specified in table file does NOT exist.

  1. Schema validation failure. E.g.,
Command output:
None is not of type 'boolean'

Failed validating 'type' in schema['properties']['read_groups']['items']['allOf'][0]['properties']['is_paired_end']:
{'type': 'boolean'}

On instance['read_groups'][1]['is_paired_end']:
None

Command error:
Payload failed to validate against the schema

This indicates that the field is_paired_end was not a boolean as required in the schema.

  • Error in valSeq
  1. Failure to validate
Payload is INVALID. Please check out details in validation report under:
/Users/esu/Desktop/GitHub/icgc-argo/argo-data-submission/argo-data-submission-wf/tests/work/31/e8eb54eb381d35b459ad943d8fbb7e

Submitted files and associated metadata did not pass validation checks. Please refer to file validation_report.INVALID.jsonl within the provided working directory to help troubleshoot.

Song/Score upload issues

  • Incorrect or expired Access Token

Please follow the instructions to check your Access Token on ARGO Data Platform to verify whether it has validity and has the appropriate scope, regenerating token if needed.

  • Incorrect Song/Score URLs

Check URLs for Song-Client and Score-Client to make sure they both are correct. For example the Song/Score URLs for RDPC hosted at OICR Toronto are:

- Song URL: `https://submission-song.rdpc.argo.genomeinformatics.org`
- Score URL: `https://submission-score.rdpc.argo.genomeinformatics.org`
  • The Object Storage providers might be undergoing maintenance, check site banners for updates
  • Error in download-aspera

Any errors or inconsistencies should be directed to EGA/maintainers of the Aspera dbox. E.g. Differing MD5 between calculated and EGA’s manifest

  • Error in decrypt-aspera

If the file does not decrypt, restart the pipeline including download. If the file is still not decrypt-able it is possible the file on the server is corrupted. Please contact the host/source of the data.

  • Error in download-pyega3

If experiencing problems with the server, please contact EGA-archive for assistance. As an alternative method, an Aspera dbox can be set up from their end.

How to resume the job

Appending a -resume on any of the aforementioned commands will resume a job e.g.

nextflow run main.nf -params-file example.json --api_token YOUR_API_TOKEN -resume

How to submit multiple samples in parallel

To efficiently handle submitting multiple samples in parallel, we suggest to run the submission workflow for each sample from different launch directory.

Assuming you have downloaded the workflow to a local directory (e.g., projectDir), you can run the submission workflow in parallel from different launch directory (e.g., launchDir) for different samples. E.g.,

cd launchDir
nextflow run path/to/projectDir/main.nf \
-profile rdpc,<singularity/docker> \
--study_id YOUR_Program_ID \
--download_mode local \
--experiment_info_tsv path/to/experiment.tsv \
--read_group_info_tsv path/to/read_groups.tsv \
--file_info_tsv path/to/files.tsv \
--api_token YOUR_API_TOKEN
note
  • Field path in table files.tsv is Required for local data and it will have to be formatted using the file path relative to the directory you run the data submission workflow

How to submit molecular data with metadata in JSON payload instead of TSVs

If you already format the sample metadata in JSON payload, you can still run the workflow to submit your data:

nextflow run main.nf \
-profile rdpc,<singularity/docker> \
--study_id YOUR_Program_ID \
--download_mode local \
--metadata_payload_json path/to/payload.json \
--data_directory path/to/molecular_data \
--api_token YOUR_API_TOKEN \
note
  • When specifying the data_directory, it should be a path relative to the directory you run the data submission workflow. Please do Not add / at the end of the path.
  • Do NOT use absolute path when specifying data directory