Accessing and Downloading Data
The ICGC ARGO Data Platform contains a harmonized dataset against the latest reference genome, GRCh38. For information on the data analysis and data types, please see Analysis Pipeline documentation.
IMPORTANT: Please make sure to upgrade the
score-client
or Docker distribution to version5.10.0
or later.
Data Download
The ARGO Data Platform uses the score-client as a file download manager. The score-client facilitates the transfer of data with resumable downloads and has built in BAM/CRAM slicing to make data download fast and smooth.
Please note:
- downloads are done in parts and can be resumed as needed
- the score-client will automatically resume downloads if interrupted or paused briefly
Searching for Files
Platform users can search for a file set of interest using the File Repository. File sets can be narrowed down by selecting specific values in the filter panel on the left side of the page. Once you have a file set identified, click the Download > File Manifest
on the top right side of the table to download a TSV file manifest.
The file manifest contains a list of the files that match your search query, along with some additional metadata to assist in file identification. The file manifest will be used by the score-client to identify the list of files to download.
NOTE: Clinical data can be downloaded by any user and does not require the score-client. In order to download controlled molecular data, you must have ICGC DACO approval for access to controlled data. Learn more about the DACO application process here, or apply for DACO approval here.
Installing the score-client
The score-client can be run in different ways depending on your operating system or setup:
- If you are on Windows, use the score-client Docker distribution.
- If you are on a Unix system (IOS/Linux) you can use the Docker distribution, or score-client directly.
Prerequisites
Using the score-client
directly requires Java 11 to be installed. The procedure for installing OpenJDk 11 will vary depending on the operating system and package manager used.
apt-get install openjdk-11-jdk
If using the Docker distribution, Java is bundled and does not need to be installed.
By default the score-client is configured to use a maximum of 8G of RAM. Most of time this is more than sufficient for fast downloads.
Distributions
- Client
- Client with ENV variables
- Docker
Pull the latest version of the score-client Docker distribution:
docker pull overture/score
Once pulled, you can open a shell in the container by executing:
docker run -it overture/score
score-client
Update the docker configuration with your user values, including:
- METADATA_URL: the file metadata Song server URL
- STORAGE_URL: the object storage Score server URL
- ACCESSTOKEN: your personal API Token
docker run -it -e "METADATA_URL=https://api.platform.icgc-argo.org/storage-api" -e "STORAGE_URL=https://api.platform.icgc-argo.org/storage-api" -e "ACCESSTOKEN=92038829-338c-4aa2-92fc2-a3c241f63ff0" overture/score
There is no entry point or command defined for the image. The software is located at score-client which is also the working directory of the container.
Download the latest score-client. Once you have unzipped the tarball, change directories into the unzipped folder:
wget -O score-client.tar.gz https://github.com/overture-stack/score/releases/download/5.11.0/score-client-5.11.0-dist.tar.gz
tar xvzf score-client.tar.gz
## Note: Once unzipped, the final directory will be suffixed with the latest release number.
cd score-client-<latest-release-number>
You can define required inputs as ENV variable, stored on system or defined with each client operation:
- METADATA_URL: the file metadata Song server URL
- STORAGE_URL: the object storage Score server URL
- ACCESSTOKEN: your personal API Token
For example to download files with a manifest:
METADATA_URL=https://api.platform.icgc-argo.org/storage-api STORAGE_URL=https://api.platform.icgc-argo.org/storage-api bin/score-client download --manifest manifest1.txt
Download the latest score-client. Once you have unzipped the tarball, change directories into the unzipped folder:
wget -O score-client.tar.gz https://github.com/overture-stack/score/releases/download/5.11.0/score-client-5.11.0-dist.tar.gz
tar xvzf score-client.tar.gz
## Note: Once unzipped, the final directory will be suffixed with the latest release number.
cd score-client-<latest-release-number>
Update the conf/application.properties
file with your user values, including:
- accessToken: your personal API Token
- metadata.url: the file metadata Song server URL
- storage.url: the object storage Score server URL
This is an example of how your application.properties
configuration file should look:
# The access token for authorized access to data
accessToken=92038829-338c-4aa2-92fc2-a3c241f63ff0
# The location of the metadata service (SONG)
metadata.url=https://api.platform.icgc-argo.org/storage-api
# The location of the object storage service (SCORE)
storage.url=https://api.platform.icgc-argo.org/storage-api
Once you have configured your application.properties
, you will be ready to use the score-client.
Score-client Usage
This section provides information on how to use the score-client once it has been properly downloaded and configured according to the distribution type.
The score-client has the general syntax:
score-client [options] [command] [command options]
It offers a set of commands, where each command has its own set of options to influence its operation. You can find all options with --help
:
Options:
--silent
Do not produce any informational messages
Default: false
--help
Show help information
Default: false
--profile
Define environment profile used to resolve configuration properties
Default: default
--quiet
Reduce output for non-interactive usage
Default: false
--version
Show version information
Default: false
Commands:
view Locally store/display some or all of a remote SAM/BAM file object
version Display application version information
mount Mount a read-only FUSE file system view of the remote storage repository
url Resolve the URL of a specified remote file object
help Display help information for a specified command name
info Display application configuration information
manifest Resolve a file object manifest and display it
download Retrieve file object(s) from the remote storage repository
upload Upload file object(s) to the remote storage repository
Download
Download a list of files by manifest
NOTE: You will experience some warnings when downloading files by manifest, however you should still be able to proceed with the download. This is a known issue that will be fixed in an upcoming release.
Using a manifest is ideal for downloading multiple files identified through the ARGO Platform.
Run the score-client using the download
command. Define your options:
- --manifest : location of the manifest file listing files to be downloaded
- --output-dir: location you want the downloaded files to be written to
For example:
bin/score-client download --manifest ./directory-path/score-manifest.20200520.tsv --output-dir ./output-directory-path
The optional --output-layout
option can be used to organize the downloads into a couple of predefined directory layouts. See the --help
option for additional information.
Download a single file by object ID
Run the score-client using the download
command. Define your options:
- --object-id : object-id of the file to be downloaded
- --output-dir: directory location you want the downloaded files to be written to
For example:
bin/score-client download --object-id ce86a332-407a-11eb-b378-0242ac130002 --output-dir ./output-directory-path
You can also specify multiple object id's separated by spaces:
bin/score-client download --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd 5cc35183-9291-5711-967d-30afcf20e71f --output-dir data
BAM/CRAM Slicing
The view command is a minimal version of samtools view. You can view a “genomic slice” of the remote BAM file, freeing the user from having to download the entire file locally, saving bytes and time. For CRAM files, kindly refer to the file page on platform (e.g. https://platform.icgc-argo.org/file/FL38379) for your specific file and use the indicated genome build to get the reference file.
The following example will download reads overlapping the region 1 - 10,000 on chromosome 1:
bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-10000 --reference-file ${local path to your reference file}
The BAI is automatically discovered and streamed as part of the operation. For quickly accessing only the BAM header one can issue:
bin/score-client view --header-only --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd
It is also possible to pipe the output of the above to samtools
, etc. for pipelining a workflow:
bin/score-client view --object-id ddcdd044-adda-5f09-8849-27d6038f8ccd --query 1:1-100000 | samtools mpileup -
Additional Data Sources
In addition to the latest harmonized data on the ICGC ARGO Platform, you can also access legacy data from the ICGC 25K project.
- ICGC 25K Data Portal: Contains a compiled dataset against the GRCh37 reference genome.
- For more information, consult the ICGC 25K Data Download documentation.
- EGA Data Portal: Contains raw datasets of data submitted to ICGC 25k.
- Data can only be downloaded through their EGA download client, but metadata may be viewed on their website. Files are grouped into datasets based on the study they were collected in, and access is granted on a dataset by dataset basis. This repository carries both ICGC and non-ICGC data.
- For more information, consult the Guide to Data Access.