Utah Genome Project

May. 2014 
Variant Calling Pipeline  Version 1.0.3

Software Versions

- GenomeAnalysisTK-3.3-1
- Picard : Version: 1.112
- FastQC v0.10.1
- Samtools Version: 0.1.19
- BWA Version: 0.7.5
- cApTUrE 1.0.3

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37):
- human_g1k_v37_decoy.fasta

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
- known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf

1000Genomes Background BAMS ran concurrently with HaplotypeCaller
- CEU
- GBR
- FIN

Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf

Sequencing

This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5

If the md5sum signature differs from that provided for the file:

Check to be sure you have the correct file.
Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
Try the download again.
Contact the sequence provider.

FastQ File Analyses

fastqc Sample1_L1_R1.txt

From the fastqc_data.txt file we check the following values:

Encoding (must be Sanger / Illumina 1.9)
Total Sequences (Need to develop a acceptable range)
Filtered Sequences (Should be 0 or at least very low)
Sequence length (must be >= 100 bp)
%GC (should be 45 < x < 55)
Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

BWA

bwa index -a bwtsw human_g1k_v37_decoy.fasta

Picard

java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic

SamTools

samtools faidx human_g1k_v37_decoy.fasta

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -

BAM File Analyses and Processing

Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.

Merge lane level BAMs to individual

This step only needs to run if you have multiple lanes per sample.

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
    INPUT=[Lane 1 bam file]
    INPUT=[Lane 2 bam file]
    INPUT=[ ... ]
    OUTPUT=Sample1.bam                          
    VALIDATION_STRINGENCY: LENIENT
    MAX_RECORDS_IN_RAM: 5000000
    CREATE_INDEX: True
    SORT_ORDER: coordinate
    ASSUME_SORTED: True 
    USE_THREADING: True
    2> MergeSamFiles.report

Mark Duplicates

Remove PCR/Optical duplicate reads

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
   INPUT=[bam files]
   OUTPUT=[dedup bam files]
   METRICS_FILE=lane1_dup_metrics.txt  
   VALIDATION_STRINGENCY: LENIENT
   MAX_RECORDS_IN_RAM: 5000000
   CREATE_INDEX: True
   ASSUME_SORTED: True       
   2> MarkDuplicates.log

Develop range for duplicate levels here

Local Realignment of Indels

RealignerTargetCreator

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar                            
   -T RealignerTargetCreator
   -I [bam files]
   -R human_g1k_v37_decoy.fasta                              
   -known /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
   -known /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
   -o realign.intervals
   -nt 24                                              
   &> RealignerTargetCreator.log

IndelRealigner

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T IndelRealigner 
   -R human_g1k_v37_decoy.fasta
   -I [bam files]
   -targetIntervals  realign.intervals
   -known Mills_and_1000G_gold_standard.indels.b37.vcf
   -known 1000G_phase1.indels.b37.vcf
   -o [dedup_realign bam files]

Base Quality Score Recalibration & PrintReads

BaseRecalibrator

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T BaseRecalibrator
  -I [bam files]
  -R human_g1k_v37_decoy.fasta
  -knownSites /data/GATK_Bundle/dbsnp_137.b37.vcf
  -o [sorted_Dedup_realign_recal_data.table files]
  &> GATK_BaseRecalibrator.log

PrintReads

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T PrintReads                       
  -I Sample1.realign.bam              
  -o Sample1.recal.bam                
  -R human_g1k_v37_decoy.fasta              
  -BQSR recalibration_report.grp

BAM Quality Control

CollectAlignmentSummaryMetrics

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp CollectAlignmentSummaryMetrics.jar
   I=[sorted_Dedup_realign_recal.bam files]
   O=[sorted_Dedup_realign_recal.metrics files]
   PROGRAM=QualityScoreDistribution
   REFERENCE_SEQUENCE=human_g1k_v37_decoy.fasta
   &> Picard_CollectMultipleMetrics.log

Variant Calling

Background Files

We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.

Groups Currently completed:

CEU
GBR

Groups to be added later:

FIN

If you would like a copy of these file (via sftp) please contact us.

HaplotypeCaller

Now HaplotypeCaller handels SNP and INDEL calls.

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T HaplotypeCaller                  
  -I [bam files]
  -o [raw.snps.indels.gvcf files]             
  -R human_g1k_v37_decoy.fasta
  -variant_index_type LINEAR
  -stand_call_conf 30.0
  -stand_emit_conf 30.0
  -emitRefConfidence GVCF
  -variant_index_parameter 128000
  -out_mode EMIT_VARIANTS_ONLY
  -L NCBI Ref_GRCh37 exon.region.list
  &> GATK_HaplotypeCaller.log

CombineGVCFs

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T CombineGVCFs
  -R human_g1k_v37_decoy.fasta
  -variant [all gvcf files created by HaplotypeCaller]
  -variant [1000G pre-combined files]
  -o [sample_mergeGvcf.vcf]
  &> GATK_CombineGVCF.log

GenotypeGVCFs

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T GenotypeGVCFs
  -R human_g1k_v37_decoy.fasta
  -variant [sampe_mergeGvcf.vcf]
  -o [sample_genotyped.vcf]
  &> GATK_GenotypeGVCF.log
  -o [sampe_mergeGvcf.vcf]

VariantRecalibrator

SNP Recalibration

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf 
  -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf 
  -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf 
  -an QD -an MQRankSum -an ReadPosRankSum -an FS
  -input [sampe_mergeGvcf.vcf]
  -recalFile [sample_snp_recal]
  -tranchesFile [sample_snp_tranches]
  -rscriptFile [sample_snp_plots.R]
  -mode SNP 
  &> GATK_VariantRecalibrator_SNP.log

INDEL Recalibration

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf 
  -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf 
  -an MQRankSum -an ReadPosRankSum -an FS
  -numBadVariants: 5000
  -mode INDEL
  -input [sampe_mergeGvcf.vcf]
  -recalFile [sample_indel_recal]
  -tranchesFile [sample_indel_tranches]
  -rscriptFile [sample_indel_plots.R]
  -mode INDEL

ApplyRecalibration

SNP Apply

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input [genotyped.vcf]
  -o [recal_SNP.vcf]
  -R human_g1k_v37_decoy.fasta
  -ts_filter_level 99.0
  -excludeFiltered
  -tranchesFile [sample.snp.tranches]
  -recalFile [sample.snp.recal]
  -mode SNP 
  &> ApplyRecalibration_SNP.report

INDEL Apply

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input [genotyped.vcf]
  -o [recal_INDEL.vcf]
  -R human_g1k_v37_decoy.fasta
  -ts_filter_level 99.0
  -excludeFiltered
  -tranchesFile [sample.indel.tranches]
  -recalFile [sample.indel.recal]
  -mode INDEL
  &> ApplyRecalibration_INDEL.report

SelectVariants & CombineVarients

These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.

Select SNP

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T SelectVariants
  -R human_g1k_v37_decoy.fasta
  -sn [target_1]
  -sn [target_2]
  -sn [...]
  --variant recal_SNP.vcf
  -o [cleaned_recal_SNP.file]
  &> SelectVariants_SNP.report

Select INDEL

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T SelectVariants
  -R human_g1k_v37_decoy.fasta
  -sn [target_1]
  -sn [target_2]
  -sn [...]
  --variant recal_INDEL.vcf
  -o [cleaned_recal_INDEL.file]
  &> SelectVariants_INDEL.report

CombineVarients

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T CombineVariants
  -R human_g1k_v37_decoy.fasta
  --variant cleaned_recal_SNP.file
  --variant cleaned_recal_INDEL.file
  -o [Final.vcf]
  &> CombineVarients.report

Variant File QC

Quality Metrics on variants

Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
HapMap concordance
SNV/Indel Counts
Rare variant enrichment
DP
Q
GQ

UGP Variant Pipeline 1.0.3

Contents