DNA-Seq Analysis Pipeline

The DNA-Sequencing (DNA-Seq) analysis pipeline identifies multiple types of somatic variant from both Whole Exome Sequencing (WXS) and Whole Genome Sequencing (WGS) sample data. DNA-Seq analysis is implemented across two main procedures:

  • Sequence Alignment
  • Variant Calling

In the future, these procedures will be extended to include:

  • Variant Masking
  • Variant Annotation
  • Consensus Calling

Alignment

The ARGO Data Platform accepts raw sequencing data in both FASTQ and BAM (aligned or unaligned) format. The first processing step in the DNA-Seq Pipeline is uniformly aligning samples to the GRCh38 reference genome. For details, please see the latest version of the ARGO DNA Alignment.

Inputs

  • All alignments are performed using GRCh38 as the human reference genome
  • Submitted FASTQ or BAM files(s)

Preprocessing

  • Submitted sequencing reads (FASTQ or BAM) are converted into lane level (i.e read group level) BAMs.
  • Picard:CollectQualityYieldMetrics is used for read group level BAM QC.

Processing

Outputs

Alignment Workflow

Sanger WGS Variant Calling

Whole genome sequencing (WGS) aligned CRAM files are processed through the Sanger WGS Variant Calling Workflow as tumour/normal pairs. The ARGO DNA Seq pipeline has adopted the Sanger Whole Genome Sequencing Analysis Docker Image as the base workflow. For details, please see the latest version of the ARGO Sanger WGS Variant Calling workflow.

Inputs

  • Normal WGS aligned CRAM and index files
  • Tumour WGS aligned CRAM and index files
  • Reference files

Processing

  • Pindel InDel caller is used for somatic insertion/deletion variant detection.
  • ASCAT CNV caller is used for somatic copy number variant analysis.
  • CaVEMan SNV caller is used for somatic single nucleotide variant analysis.
  • BRASS SV caller is used for somatic structural variation detection.

Collect QC Metrics

  • WGS aligned reads statistics are generated by Sanger:bam_stats script. The files containing normal/tumour aligned reads statistics are further used by Pindel and BRASS callers.
  • Cross sample contamination is estimated by Sanger:verifyBamHomChk script for both normal and tumour samples.
  • Purity and ploidy are estimated by ASCAT CNV caller
  • Genotypes of CRAM files from the matched normal/tumour pair are compared and the fraction of matched genotypes are produced by Sanger:compareBamGenotypes script. It also checks if the inferred genders are matched.

Outputs

Sanger WGS Variant Calling Workflow

Sanger WXS Variant Calling

Whole exome sequencing (WXS) aligned CRAM files are processed through the Sanger WXS Variant Calling Workflow as tumour/normal pairs. The ARGO DNA Seq pipeline has adopted the Sanger Whole Exome Sequencing Analysis Docker Image as the base workflow. For details, please see the latest version of the ARGO Sanger WXS Variant Calling workflow.

Inputs

  • Normal WXS aligned CRAM and index files
  • Tumour WXS aligned CRAM and index files
  • Reference files

Processing

  • Pindel InDel caller is used for somatic insertion/deletion variant detection.
  • CaVEMan SNV caller is used for somatic single nucleotide variant analysis.

Collect QC Metrics

  • WXS aligned reads statistics are generated by Sanger:bam_stats script. The files containing normal/tumour aligned reads statistics are further used by Pindel caller.

Outputs

Sanger WXS Variant Calling Workflow

GATK Mutect2 Variant Calling

Whole genome/exome sequencing (WGS/WXS) aligned CRAM files are processed through the GATK Mutect2 Variant Calling Workflow as tumour/normal pairs. The ARGO DNA Seq pipeline has adopted the Genome Analysis Toolkit Docker Image developed at Broad Institute as the base workflow. For details, please see the latest version of the ARGO GATK Mutect2 Variant Calling workflow.

Inputs

  • Normal WGS/WXS aligned CRAM and index files
  • Tumour WGS/WXS aligned CRAM and index files
  • Reference files

Processing

  • BQSR Subworkflow is an optional data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call. While availble as part of the workflow, this is not run as part of the ARGO pipeline.
  • Mutect2 calls SNV and InDel simultaneously via local de-novo assembly of haplotypes in an active region.
  • Learn Read Orientation implements the read orientation model, which produces the --orientation-bias-artifact-priors input to the step Filter Variants.
  • Calculate Contamination Subworkflow emits an estimate of the fraction of reads due to cross-sample contamination for both normal and tumour samples. It also generates an estimate of the allelic copy number segmentation of each tumour sample.
  • Filter Variants applies filters to the raw output of Mutect2.

Collect QC Metrics

  • Cross sample contamination is estimated by GATK:CalculateContamination for both normal and tumour samples
  • Variant callable stats file is generated by GATK:Mutect2
  • Variant filtering stats file is produced by GATK:FilterMutectCalls

Outputs

GATK Mutect2 Variant Calling workflow