Skip to main content

Legacy ICGC 25K Data

ICGC 25K Portal and Project

The ICGC Data Portal was retired in June 2024.

The Data Portal served as a hub and repository, a culmination of the collaborative effort between International Cancer Genome Consortium and multiple partners from various cancer projects (including TCGA and Sanger Cancer genome project).

The aim was to analyze multiple cancer types and share open access data such as simple somatic mutations, copy number alterations, structural rearrangements, gene expression, microRNAs, DNA methylation and exon junctions.

The dataset spanned 86 cancer projects, 22 cancer primary sites, 24,289 donors of which 22,330 had molecular data, and 81,782,588 simple somatic mutations.

Relocated ICGC 25K Data

Although the interactive data portal was shut down, the final release data, as well as the PCAWG project data from the project remain available. See details below for how to access these data.

Files from other contributing projects are all hosted by ICGC's partner repositories. To access this data, you will need to identify which repository hosts the data you are looking for and then request the data through their service. A mapping file is available to download below which maps ICGC 25K file IDs to their current hosted location.

Accessing ICGC 25K Release Data

ICGC Release Data contains both open and controlled access data.

All open access release data is stored on a publicly available Object Storage Bucket and is available to everyone.

The controlled access release data is hosted on an SFTP server, and is only available to authorized DACO-approved users. If you previously had DACO access for ICGC 25K data you will continue to have permission to access the SFTP server. If you require DACO approval please see the documentation on applying for DACO access.

Both locations contain directories with the following data:

  • /release_28 - This is the Data Portal data Release 28 (2019-11-26) of the International Cancer Genome Consortium (ICGC).
  • /PCAWG - Analysis results from the PCAWG study.
  • /Supplemental - Corrected clinical metadata and RNA-Seq raw read counts (2019-10-16) for projects LICA-FR and PRAD-UK.

Controlled Release Data - SFTP Connection Details

The SFTP server is located at:

  • Host: icgc-legacy-sftp.platform.icgc-argo.org
  • Port: 2222

Authentication to the server is done using username and password:

  • Username: The email address that was approved for DACO access. This is the account you would use to log into the ARGO platform.
  • Password: ICGC API Key. This is available on your ARGO Platform profile page.

You can connect to this server using any SFTP client of your choice.

Open Release Data - Object Bucket Details

Open access release data is hosted on a publicly available Object Storage Bucket. While not hosted on Amazon AWS, it uses the AWS S3 interface and is therefore accessible using any S3 compatible object storage client.

The bucket is reachable at:

  • Host: https://object.genomeinformatics.org
  • Bucket Name: icgc25k-open

No additional authentication is required.

Using the AWS CLI to access the open data bucket

Instructions for installing the AWS CLI are found here.

This data is not hosted by AWS, so you will need to specify an --endpoint-url argument when using this tool so that it knows where to find this bucket. Below are example commands to accomplish some common use cases.

To navigate and explore the data:

aws s3 ls s3://icgc25k-open --endpoint-url https://object.genomeinformatics.org --no-sign-request

To download a file or recursively download a directory:

aws s3 cp s3://icgc25k-open/PCAWG/consensus_snv_indel/README.md <local-download-directory> --endpoint-url https://object.genomeinformatics.org --no-sign-request
aws s3 cp s3://icgc25k-open/PCAWG/consensus_snv_indel <local-download-directory> --recursive --endpoint-url https://object.genomeinformatics.org --no-sign-request

Partner Repositories with ICGC 25K File Data

ICGC 25K file data is hosted across the following repositories.

If you know the specific file IDs to access, the provided Mapping File links those to their alternate locations.

EGA

ICGC 25k DACO approval includes access to raw sequences submitted to ICGC and reprocessed PCAWG files hosted on EGA. Available datasets are listed under the data access committee EGAC00001000010.

Upon DACO approval, the institutional email (the same email used in your DACO application, not the affiliated Gmail account) can be used to log in/sign up at EGA-Archive.

If the email has never been used to access EGA, you will need to create an local EGA account; please follow the EGA password reset procedure to do so.

Access to EGA may take up to 48 hours post DACO approval.

TCGA

Due to data regulation policies, raw sequencing data submitted to ICGC and affiliated data from the PCAWG study are controlled under dbGap study phs000178.

ICGC 25k DACO approval does not include dbGap study phs000178 and will not grant access to said files.

Access can be requested by following instructions at phs000178.

To access data, navigate to PDC/ICGC Bionimbus for PCAWG affiliated data or GDC/GDC Portal for raw sequencing data. Both portal will provide a login prompt using eRA commons ID and password.

Mapping ICGC Legacy Data to External Repositories

Provided below is a download link for our legacy data mapping file. This will download a TSV file which contains a mapping between ICGC File IDs and their current hosted location(s), relevant ICGC mapping IDs (object, donor, sample, specimen), and IDs for affiliated repositories.

Download: icgc25k-file-mapping.tsv (54MB)

To find files listed in this TSV, check the location column:

  • SFTP - files are saved at listed SFTP_location or within the listed file
  • EGA - file can be found by following the ega_dataset_id, ega_analysis_id, ega_file_id, or ega_run_id
  • PDC - file can be found using the PDC_ID
  • GDC - file can be found using GDC_ID

Data Use and Publication Policy

Please see ICGC ARGO's publication policy and data use policies.

Questions and Concerns

If you have any further questions or require additional information please contact the helpdesk.

Frequently Asked Questions

1. Using CLI SFTP, I cannot find anything inside the Example folder

sftp> ls PCAWG/*
PCAWG/APOBEC_mutagenesis PCAWG/Hartwig PCAWG/README.md PCAWG/benchmarking_data PCAWG/broad_calls PCAWG/cell_lines PCAWG/clinical_and_histology PCAWG/consensus_cnv
PCAWG/consensus_snv_indel PCAWG/consensus_sv PCAWG/data_releases PCAWG/dkfz_embl_calls PCAWG/donors_and_biospecimens PCAWG/driver_mutations PCAWG/drivers PCAWG/evolution_and_heterogeneity
PCAWG/germline_variations PCAWG/hla_and_neoantigen PCAWG/minibams PCAWG/msi PCAWG/muse_calls PCAWG/mutational_signatures PCAWG/networks PCAWG/pathogen_analysis
PCAWG/pcawg_dkfz_caller PCAWG/pilot50-mosaic PCAWG/pilot50_calls PCAWG/quality_control_info PCAWG/reference_data PCAWG/retrotransposition PCAWG/rnaseq_aligned_bams PCAWG/sanger_calls
PCAWG/sequencing_metadata PCAWG/smufin_indel_calls PCAWG/subclonal_reconstruction PCAWG/terminology_and_standard_colours PCAWG/thesaurus_snv PCAWG/transcriptome PCAWG/unaligned_bams PCAWG/validation_bams
PCAWG/wgs_aligned_bams

2. Connecting through CLI SFTP, I receive the following error: no matching host key type found. Their offer: ssh-rsa

This will depends on system but ensure your ~/.ssh/.config has the following content:

HostKeyAlgorithms ssh-rsa
PubkeyAcceptedKeyTypes ssh-rsa

Or add -o HostKeyAlgorithms=+ssh-rsa to your SFTP command, E.g:

sftp -P 2222 -o HostKeyAlgorithms=+ssh-rsa 'example@gmail.com'@icgc-legacy-sftp.platform.icgc-argo.org