Skip to main content

Accessing and Downloading Data

The ICGC ARGO Data Platform contains a harmonized dataset against the latest reference genome, GRCh38. For information on the data analysis and data types, please see Analysis Pipeline documentation.

IMPORTANT: Please make sure to upgrade the score-client or Docker distribution to version 5.10.0 or later.

Data Download

The ARGO Data Platform uses the score-client as a file download manager. The score-client facilitates the transfer of data with resumable downloads and has built in BAM/CRAM slicing to make data download fast and smooth.

Please note:

  • downloads are done in parts and can be resumed as needed
  • the score-client will automatically resume downloads if interrupted or paused briefly

Searching for Files

Platform users can search for a file set of interest using the File Repository. File sets can be narrowed down by selecting specific values in the filter panel on the left side of the page. Once you have a file set identified, click the Download > File Manifest on the top right side of the table to download a TSV file manifest.

The file manifest contains a list of the files that match your search query, along with some additional metadata to assist in file identification. The file manifest will be used by the score-client to identify the list of files to download.

NOTE: Clinical data can be downloaded by any user and does not require the score-client. In order to download controlled molecular data, you must have ICGC DACO approval for access to controlled data. Learn more about the DACO application process here, or apply for DACO approval here.

Installing the score-client

The score-client can be run in different ways depending on your operating system or setup:

  • If you are on Windows, use the score-client Docker distribution.
  • If you are on a Unix system (IOS/Linux) you can use the Docker distribution, or score-client directly.

Prerequisites

Using the score-client directly requires Java 11 to be installed. The procedure for installing OpenJDk 11 will vary depending on the operating system and package manager used.

apt-get install openjdk-11-jdk

If using the Docker distribution, Java is bundled and does not need to be installed.

By default the score-client is configured to use a maximum of 8G of RAM. Most of time this is more than sufficient for fast downloads.

Distributions

Download the latest version of the score-client. Once you have unzipped the tarball, change directories into the unzipped folder:

wget -O score-client.tar.gz https://artifacts.oicr.on.ca/artifactory/dcc-release/bio/overture/score-client/[RELEASE]/score-client-[RELEASE]-dist.tar.gz

tar xvzf score-client.tar.gz

## Note: Once unzipped, the final directory will be suffixed with the latest release number.
cd score-client-<latest-release-number>

Update the conf/application.properties file with your user values, including:

  • accessToken: your personal API Token
  • metadata.url: the file metadata Song server URL
  • storage.url: the object storage Score server URL

This is an example of how your application.properties configuration file should look:

# The access token for authorized access to data
accessToken=92038829-338c-4aa2-92fc2-a3c241f63ff0

# The location of the metadata service (SONG)
metadata.url=https://api.platform.icgc-argo.org/storage-api

# The location of the object storage service (SCORE)
storage.url=https://api.platform.icgc-argo.org/storage-api

Once you have configured your application.properties, you will be ready to use the score-client.

Score-client Usage

This section provides information on how to use the score-client once it has been properly downloaded and configured according to the distribution type.

The score-client has the general syntax:

score-client [options] [command] [command options]

It offers a set of commands, where each command has its own set of options to influence its operation. You can find all options with --help:

  Options:
--silent
Do not produce any informational messages
Default: false
--help
Show help information
Default: false
--profile
Define environment profile used to resolve configuration properties
Default: default
--quiet
Reduce output for non-interactive usage
Default: false
--version
Show version information
Default: false
Commands:
view Locally store/display some or all of a remote SAM/BAM file object
version Display application version information
mount Mount a read-only FUSE file system view of the remote storage repository
url Resolve the URL of a specified remote file object
help Display help information for a specified command name
info Display application configuration information
manifest Resolve a file object manifest and display it
download Retrieve file object(s) from the remote storage repository
upload Upload file object(s) to the remote storage repository

Download

Download a list of files by manifest

NOTE: You will experience some warnings when downloading files by manifest, however you should still be able to proceed with the download. This is a known issue that will be fixed in an upcoming release.

Using a manifest is ideal for downloading multiple files identified through the ARGO Platform.

Run the score-client using the download command. Define your options:

  • --manifest : location of the manifest file listing files to be downloaded
  • --output-dir: location you want the downloaded files to be written to

For example:

bin/score-client download --manifest ./directory-path/score-manifest.20200520.tsv --output-dir ./output-directory-path

The optional --output-layout option can be used to organize the downloads into a couple of predefined directory layouts. See the --help option for additional information.

Download a single file by object ID

Run the score-client using the download command. Define your options:

  • --object-id : object-id of the file to be downloaded
  • --output-dir: directory location you want the downloaded files to be written to

For example:

bin/score-client download --object-id ce86a332-407a-11eb-b378-0242ac130002 --output-dir ./output-directory-path

You can also specify multiple object id's separated by spaces:

bin/score-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd 5cc35183-9291-5711-967d-30afcf20e71f --output-dir data

BAM/CRAM Slicing

The view command is a minimal version of samtools view. You can view a “genomic slice” of the remote BAM file, freeing the user from having to download the entire file locally, saving bytes and time. For CRAM files, kindly refer to the file page on platform (e.g. https://platform.icgc-argo.org/file/FL38379) for your specific file and use the indicated genome build to get the reference file.

The following example will download reads overlapping the region 1 - 10,000 on chromosome 1:

bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-10000 --reference-file ${local path to your reference file}

The BAI is automatically discovered and streamed as part of the operation. For quickly accessing only the BAM header one can issue:

bin/score-client view --header-only --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd

It is also possible to pipe the output of the above to samtools, etc. for pipelining a workflow:

bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-100000 | samtools mpileup -

FUSE Mounting

The mount command can be used to mount the remote S3 bucket as a read-only FUSE file system. This is very useful to browse and explore the available files, as well as quickly see their size and date of modification using common commands such as ls, find, du and tree. It also works very well with standard analysis tools such as samtools.

In order to use the mount feature, FUSE is required. On most Linux based systems, this will require installing libfuse-dev, fuse and other packages. Below is the command to install them on Ubuntu.

sudo apt-get install -y libfuse-dev fuse curl wget software-properties-common

Files are organized into a virtual directory structure. The following shows the default bundle layout:

/bundleId1/fileName1
/bundleId1/fileName2
...
/bundleId1/fileNamei
...
/bundleIdn/fileName1
/bundleIdn/fileName2
...
/bundleIdn/fileNamej

where bundleId and fileName are the original Bundle ID and file name of the file respectively. It is possible to control the layout using the --layout option. Using --layout object-id will instead produce a flat list of files named by their associated object id.

The file system implementation's performance is optimized for serial reads. Frequent random access patterns will lead to very poor performance. Under the covers, each random seek requires a new HTTP connection to S3 with the appropriate Range header set which is an expensive operation. For this reason, it is only recommended for streaming analysis (e.g. samtools view like functionality).

Mount a manifest of files

# Create the mount point
sudo mkdir /mnt/icgc-argo
sudo chmod 777 /mnt/icgc-argo

# Mount
bin/score-client mount --mount-point /mnt/icgc-argo --manifest manifest_file_name.txt --cache-metadata

Once mounted, you can use standard analysis tools against files found under the mount point:

samtools view /mnt/icgc/fff75930-0f8c-4c99-9b48-732e7ed4c625/443a7a6ab964e41c011cc9a303bc086c.bam 1:10000-20000

Additional Data Sources

In addition to the latest harmonized data on the ICGC ARGO Platform, you can also access legacy data from the ICGC 25K project.

  • ICGC 25K Data Portal: Contains a compiled dataset against the GRCh37 reference genome.
  • EGA Data Portal: Contains raw datasets of data submitted to ICGC 25k.
    • Data can only be downloaded through their EGA download client, but metadata may be viewed on their website. Files are grouped into datasets based on the study they were collected in, and access is granted on a dataset by dataset basis. This repository carries both ICGC and non-ICGC data.
    • For more information, consult the Guide to Data Access.