Legacy ICGC 25K Data

ICGC 25K Portal and Project

The ICGC Data Portal was retired in June 2024.

The Data Portal served as a hub and repository, a culmination of the collaborative effort between International Cancer Genome Consortium and multiple partners from various cancer projects (including TCGA and Sanger Cancer genome project).

The aim was to analyze multiple cancer types and share open access data such as simple somatic mutations, copy number alterations, structural rearrangements, gene expression, microRNAs, DNA methylation and exon junctions.

The dataset spanned 86 cancer projects, 22 cancer primary sites, 24,289 donors of which 22,330 had molecular data, and 81,782,588 simple somatic mutations.

Relocated ICGC 25K Data

Although the interactive data portal was shut down, the final release data, as well as the PCAWG project data from the project remain available. See details below for how to access these data.

Files from other contributing projects are all hosted by ICGC's partner repositories. To access this data, you will need to identify which repository hosts the data you are looking for and then request the data through their service. A mapping file is available to download below which maps ICGC 25K file IDs to their current hosted location.

Accessing ICGC 25K Release Data

ICGC Release Data contains both open and controlled access data.

All open access release data is stored on a publicly available Object Storage Bucket and is available to everyone.

The controlled access release data is hosted on an SFTP server, and is only available to authorized DACO-approved users. If you previously had DACO access for ICGC 25K data you will continue to have permission to access the SFTP server. If you require DACO approval please see the documentation on applying for DACO access.

Both locations contain directories with the following data:

/release_28 - This is the Data Portal data Release 28 (2019-11-26) of the International Cancer Genome Consortium (ICGC).
/PCAWG - Analysis results from the PCAWG study.
/Supplemental - Corrected clinical metadata and RNA-Seq raw read counts (2019-10-16) for projects LICA-FR and PRAD-UK.

Controlled Release Data - SFTP Connection Details

The SFTP server is located at:

Host: icgc-legacy-sftp.platform.icgc-argo.org
Port: 2222

Authentication to the server is done using username and password:

Username: The email address that was approved for DACO access. This is the account you would use to log into the ARGO platform.
Password: ICGC API Key. This is available on your ARGO Platform profile page.

You can connect to this server using any SFTP client of your choice.

Open Release Data - Object Bucket Details

Open access release data is hosted on a publicly available Object Storage Bucket. While not hosted on Amazon AWS, it uses the AWS S3 interface and is therefore accessible using any S3 compatible object storage client.

The bucket is reachable at:

Host: https://object.genomeinformatics.org
Bucket Name: icgc25k-open

No additional authentication is required.

Using the AWS CLI to access the open data bucket

Instructions for installing the AWS CLI are found here.

This data is not hosted by AWS, so you will need to specify an --endpoint-url argument when using this tool so that it knows where to find this bucket. Below are example commands to accomplish some common use cases.

To navigate and explore the data:

aws s3 ls s3://icgc25k-open --endpoint-url https://object.genomeinformatics.org --no-sign-request

To download a file or recursively download a directory:

aws s3 cp s3://icgc25k-open/PCAWG/consensus_snv_indel/README.md <local-download-directory> --endpoint-url https://object.genomeinformatics.org --no-sign-request
aws s3 cp s3://icgc25k-open/PCAWG/consensus_snv_indel <local-download-directory> --recursive --endpoint-url https://object.genomeinformatics.org --no-sign-request

Partner Repositories with ICGC 25K File Data

ICGC 25K file data is hosted across the following repositories.

If you know the specific file IDs to access, the provided Mapping File links those to their alternate locations.

EGA

ICGC 25k DACO approval includes access to raw sequences submitted to ICGC and reprocessed PCAWG files hosted on EGA. Available datasets are listed under the data access committee EGAC00001000010.

Upon DACO approval, the institutional email (the same email used in your DACO application, not the affiliated Gmail account) can be used to log in/sign up at EGA-Archive.

If the email has never been used to access EGA, you will need to create an local EGA account; please follow the EGA password reset procedure to do so.

Access to EGA may take up to 48 hours post DACO approval.

TCGA

Due to data regulation policies, raw sequencing data submitted to ICGC and affiliated data from the PCAWG study are controlled under dbGap study phs000178.

ICGC 25k DACO approval does not include dbGap study phs000178 and will not grant access to said files.

Access can be requested by following instructions at phs000178.

To access data, navigate to PDC/ICGC Bionimbus for PCAWG affiliated data or GDC/GDC Portal for raw sequencing data. Both portal will provide a login prompt using eRA commons ID and password.

Mapping ICGC Legacy Data to External Repositories

Provided below is a download link for our legacy data mapping file. This will download a TSV file which contains a mapping between ICGC File IDs and their current hosted location(s), relevant ICGC mapping IDs (object, donor, sample, specimen), and IDs for affiliated repositories.

Download: icgc25k-file-mapping.tsv (54MB)

To find files listed in this TSV, check the location column:

SFTP - files are saved at listed SFTP_location or within the listed file
EGA - file can be found by following the ega_dataset_id, ega_analysis_id, ega_file_id, or ega_run_id
PDC - file can be found using the PDC_ID
GDC - file can be found using GDC_ID

Data Use and Publication Policy

Please see ICGC ARGO's publication policy and data use policies.

Questions and Concerns

If you have any further questions or require additional information please contact the helpdesk.

Frequently Asked Questions

1. Using CLI SFTP, I cannot find anything inside the Example folder

sftp> ls PCAWG/*
PCAWG/APOBEC_mutagenesis                     PCAWG/Hartwig                                PCAWG/README.md                              PCAWG/benchmarking_data                      PCAWG/broad_calls                            PCAWG/cell_lines                             PCAWG/clinical_and_histology                 PCAWG/consensus_cnv
PCAWG/consensus_snv_indel                    PCAWG/consensus_sv                           PCAWG/data_releases                          PCAWG/dkfz_embl_calls                        PCAWG/donors_and_biospecimens                PCAWG/driver_mutations                       PCAWG/drivers                                PCAWG/evolution_and_heterogeneity
PCAWG/germline_variations                    PCAWG/hla_and_neoantigen                     PCAWG/minibams                               PCAWG/msi                                    PCAWG/muse_calls                             PCAWG/mutational_signatures                  PCAWG/networks                               PCAWG/pathogen_analysis
PCAWG/pcawg_dkfz_caller                      PCAWG/pilot50-mosaic                         PCAWG/pilot50_calls                          PCAWG/quality_control_info                   PCAWG/reference_data                         PCAWG/retrotransposition                     PCAWG/rnaseq_aligned_bams                    PCAWG/sanger_calls
PCAWG/sequencing_metadata                    PCAWG/smufin_indel_calls                     PCAWG/subclonal_reconstruction               PCAWG/terminology_and_standard_colours       PCAWG/thesaurus_snv                          PCAWG/transcriptome                          PCAWG/unaligned_bams                         PCAWG/validation_bams
PCAWG/wgs_aligned_bams

2. Connecting through CLI SFTP, I receive the following error: `no matching host key type found. Their offer: ssh-rsa`

This will depends on system but ensure your ~/.ssh/.config has the following content:

HostKeyAlgorithms ssh-rsa
PubkeyAcceptedKeyTypes ssh-rsa

Or add -o HostKeyAlgorithms=+ssh-rsa to your SFTP command, E.g:

sftp -P 2222 -o HostKeyAlgorithms=+ssh-rsa 'example@gmail.com'@icgc-legacy-sftp.platform.icgc-argo.org

ICGC 25K Portal and Project​

Relocated ICGC 25K Data​

Accessing ICGC 25K Release Data​

Controlled Release Data - SFTP Connection Details​

Open Release Data - Object Bucket Details​

Using the AWS CLI to access the open data bucket​

Partner Repositories with ICGC 25K File Data​

EGA​

TCGA​

Mapping ICGC Legacy Data to External Repositories​

Data Use and Publication Policy​

Questions and Concerns​

Frequently Asked Questions​

1. Using CLI SFTP, I cannot find anything inside the Example folder​

2. Connecting through CLI SFTP, I receive the following error: no matching host key type found. Their offer: ssh-rsa​