Skip to main content

Legacy ICGC 25K Data

ICGC 25K Portal and Project

The ICGC Data Portal was retired in June 2024.

The Data Portal served as a hub and repository, a culmination of the collaborative effort between International Cancer Genome Consortium and multiple partners from various cancer projects (including TCGA and Sanger Cancer genome project).

The aim was to analyze multiple cancer types and share open access data such as simple somatic mutations, copy number alterations, structural rearrangements, gene expression, microRNAs, DNA methylation and exon junctions.

The dataset spanned 86 cancer projects, 22 cancer primary sites, 24,289 donors of which 22,330 had molecular data, and 81,782,588 simple somatic mutations.

Relocated ICGC 25K Data

Although the interactive data portal was shut down, data from the project remains available. The final release data, as well as the PCAWG project data are now available to authorized users through a SFTP server hosted by ICGC ARGO.

Files from other contributing projects are all hosted by ICGC's partner repositories. To access this data, you will need to identify which repository hosts the data you are looking for and then request the data through their service. A mapping file is available to download below which maps ICGC 25K file IDs to their current hosted location.

Accessing ICGC 25K Release Data

A SFTP server is available to access ICGC Release Data and PCAWG data.

The server hosts three data directories with the following data:

  • /release_28 - This is the Data Portal data Release 28 (2019-11-26) of the International Cancer Genome Consortium (ICGC).
  • /PCAWG - Analysis results from the PCAWG study.
  • /Supplemental - Corrected clinical metadata and RNA-Seq raw read counts (2019-10-16) for projects LICA-FR and PRAD-UK.

The SFTP server is available to authorized, DACO-approved users only. If you previously had DACO access for ICGC 25K data you will continue to have permission to access the SFTP server. If you require DACO approval please see the documentation on applying for DACO access.

SFTP Connection Details

The SFTP server is located at:

  • Host: icgc-legacy-sftp.platform.icgc-argo.org
  • Port: 2222

Authentication to the server is done using username and password:

  • Username: The email address that was approved for DACO access. This is the account you would use to log into the ARGO platform.
  • Password: ICGC API Key. This is available on your ARGO Platform profile page.

You can connect to this server using any SFTP client of your choice.

Partner Repositories with ICGC 25K File Data

ICGC 25K file data is hosted across the following repositories.

If you know the specific file IDs to access, the provided Mapping File links those to their alternate locations.

EGA

ICGC 25k DACO approval includes access to raw sequences submitted to ICGC and reprocessed PCAWG files hosted on EGA. Available datasets are listed under the data access committee EGAC00001000010.

Upon DACO approval, the institutional email (the same email used in your DACO application, not the affiliated Gmail account) can be used to log in/sign up at EGA-Archive.

If the email has never been used to access EGA, you will need to create an local EGA account; please follow the EGA password reset procedure to do so.

Access to EGA may take up to 48 hours post DACO approval.

TCGA

Due to data regulation policies, raw sequencing data submitted to ICGC and affiliated data from the PCAWG study are controlled under dbGap study phs000178.

ICGC 25k DACO approval does not include dbGap study phs000178 and will not grant access to said files.

Access can be requested by following instructions at phs000178.

To access data, navigate to PDC/ICGC Bionimbus for PCAWG affiliated data or GDC/GDC Portal for raw sequencing data. Both portal will provide a login prompt using eRA commons ID and password.

Mapping ICGC Legacy Data to External Repositories

Provided below is a download link for our legacy data mapping file. This will download a TSV file which contains a mapping between ICGC File IDs and their current hosted location(s), relevant ICGC mapping IDs (object, donor, sample, specimen), and IDs for affiliated repositories.

Download: icgc25k-file-mapping.tsv (54MB)

To find files listed in this TSV, check the location column:

  • SFTP - files are saved at listed SFTP_location or within the listed file
  • EGA - file can be found by following the ega_dataset_id, ega_analysis_id, ega_file_id, or ega_run_id
  • PDC - file can be found using the PDC_ID
  • GDC - file can be found using GDC_ID

Data Use and Publication Policy

Please see ICGC ARGO's publication policy and data use policies.

Questions and Concerns

If you have any further questions or require additional information please contact the helpdesk.

Frequently Asked Questions

1. Using CLI SFTP, I cannot find anything inside the Example folder

sftp> ls PCAWG/*
PCAWG/APOBEC_mutagenesis PCAWG/Hartwig PCAWG/README.md PCAWG/benchmarking_data PCAWG/broad_calls PCAWG/cell_lines PCAWG/clinical_and_histology PCAWG/consensus_cnv
PCAWG/consensus_snv_indel PCAWG/consensus_sv PCAWG/data_releases PCAWG/dkfz_embl_calls PCAWG/donors_and_biospecimens PCAWG/driver_mutations PCAWG/drivers PCAWG/evolution_and_heterogeneity
PCAWG/germline_variations PCAWG/hla_and_neoantigen PCAWG/minibams PCAWG/msi PCAWG/muse_calls PCAWG/mutational_signatures PCAWG/networks PCAWG/pathogen_analysis
PCAWG/pcawg_dkfz_caller PCAWG/pilot50-mosaic PCAWG/pilot50_calls PCAWG/quality_control_info PCAWG/reference_data PCAWG/retrotransposition PCAWG/rnaseq_aligned_bams PCAWG/sanger_calls
PCAWG/sequencing_metadata PCAWG/smufin_indel_calls PCAWG/subclonal_reconstruction PCAWG/terminology_and_standard_colours PCAWG/thesaurus_snv PCAWG/transcriptome PCAWG/unaligned_bams PCAWG/validation_bams
PCAWG/wgs_aligned_bams

2. Connecting through CLI SFTP, I receive the following error: no matching host key type found. Their offer: ssh-rsa

This will depends on system but ensure your ~/.ssh/.config has the following content:

HostKeyAlgorithms ssh-rsa
PubkeyAcceptedKeyTypes ssh-rsa