Difference between revisions of "UGP Variant Pipeline 1.0.3"
Line 245: | Line 245: | ||
-resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf | -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf | ||
-an QD -an MQRankSum -an ReadPosRankSum -an FS | -an QD -an MQRankSum -an ReadPosRankSum -an FS | ||
− | -input [ | + | -input [sample+background_genotyped.vcf] |
-recalFile [sample_snp_recal] | -recalFile [sample_snp_recal] | ||
-tranchesFile [sample_snp_tranches] | -tranchesFile [sample_snp_tranches] | ||
Line 261: | Line 261: | ||
-numBadVariants: 5000 | -numBadVariants: 5000 | ||
-mode INDEL | -mode INDEL | ||
− | -input [ | + | -input [sample+background_genotyped.vcf] |
-recalFile [sample_indel_recal] | -recalFile [sample_indel_recal] | ||
-tranchesFile [sample_indel_tranches] | -tranchesFile [sample_indel_tranches] |
Revision as of 20:54, 13 August 2014
Contents
Utah Genome Project
May. 2014 Variant Calling Pipeline Version 1.0.3
Software Versions
- GenomeAnalysisTK-3.3-1
- Picard : Version: 1.112
- FastQC v0.10.1
- Samtools Version: 0.1.19
- BWA Version: 0.7.5
- cApTUrE 1.0.3
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
- Reference Genome (GRCh37):
- human_g1k_v37_decoy.fasta
- VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
- known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
- Background Files
- We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
- Groups Currently completed:
- CEU
- GBR
If you would like a copy of these file they are available via our sftp site;
sftp yandell-lab@io.genetics.utah.edu Password: contact me (shawn.rynearson@gmail.com) location: files/1000Genome-background_GATK-3.0+
- Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf
- Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf
Sequencing
This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]
Validate File Integrity with md5sum
An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.
md5sum file.fastq > file_local.md5 diff file_local.md5 file_provided.md5
If the md5sum signature differs from that provided for the file:
- Check to be sure you have the correct file.
- Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
- Try the download again.
- Contact the sequence provider.
FastQ File Analyses
fastqc Sample1_L1_R1.txt
From the fastqc_data.txt file we check the following values:
- Encoding (must be Sanger / Illumina 1.9)
- Total Sequences (Need to develop a acceptable range)
- Filtered Sequences (Should be 0 or at least very low)
- Sequence length (must be >= 100 bp)
- %GC (should be 45 < x < 55)
- Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).
Indexing
The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".
- BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
- Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
- SamTools
samtools faidx human_g1k_v37_decoy.fasta
Alignment
Align reads to the genome with bwa.
The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -
BAM File Analyses and Processing
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
Merge lane level BAMs to individual
- This step only needs to run if you have multiple lanes per sample.
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar INPUT=[Lane 1 bam file] INPUT=[Lane 2 bam file] INPUT=[ ... ] OUTPUT=Sample1.bam VALIDATION_STRINGENCY: LENIENT MAX_RECORDS_IN_RAM: 5000000 CREATE_INDEX: True SORT_ORDER: coordinate ASSUME_SORTED: True USE_THREADING: True 2> MergeSamFiles.report
Mark Duplicates
Remove PCR/Optical duplicate reads
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar INPUT=[bam files] OUTPUT=[dedup bam files] METRICS_FILE=lane1_dup_metrics.txt VALIDATION_STRINGENCY: LENIENT MAX_RECORDS_IN_RAM: 5000000 CREATE_INDEX: True ASSUME_SORTED: True 2> MarkDuplicates.log
Develop range for duplicate levels here
Local Realignment of Indels
- RealignerTargetCreator
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T RealignerTargetCreator -I [bam files] -R human_g1k_v37_decoy.fasta -known /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -known /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -o bam_file_realign.intervals -nt 24 &> RealignerTargetCreator.log
- IndelRealigner
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I [bam files] -targetIntervals realign.intervals -known Mills_and_1000G_gold_standard.indels.b37.vcf -known 1000G_phase1.indels.b37.vcf -o [dedup_realign bam files]
Base Quality Score Recalibration & PrintReads
- BaseRecalibrator
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T BaseRecalibrator -I [bam files] -R human_g1k_v37_decoy.fasta -knownSites /data/GATK_Bundle/dbsnp_137.b37.vcf -o [sorted_Dedup_realign_recal_data.table files] &> GATK_BaseRecalibrator.log
- PrintReads
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T PrintReads -I Sample1.realign.bam -o Sample1.recal.bam -R human_g1k_v37_decoy.fasta -BQSR recalibration_report.grp
BAM Quality Control
- CollectAlignmentSummaryMetrics
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp CollectAlignmentSummaryMetrics.jar I=[sorted_Dedup_realign_recal.bam files] O=[sorted_Dedup_realign_recal.metrics files] PROGRAM=QualityScoreDistribution REFERENCE_SEQUENCE=human_g1k_v37_decoy.fasta &> Picard_CollectMultipleMetrics.log
Variant Calling
HaplotypeCaller
- Now HaplotypeCaller handels SNP and INDEL calls.
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T HaplotypeCaller -I [bam files] -o [raw.snps.indels.gvcf files] -R human_g1k_v37_decoy.fasta -variant_index_type LINEAR -stand_call_conf 30.0 -stand_emit_conf 30.0 -emitRefConfidence GVCF -variant_index_parameter 128000 -L NCBI Ref_GRCh37 exon.region.list &> GATK_HaplotypeCaller.log
CombineGVCFs
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37_decoy.fasta -variant [all gvcf files created by HaplotypeCaller] &> GATK_CombineGVCF.log
GenotypeGVCFs
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta -variant [sampe_mergeGvcf.vcf] -variant [1000G pre-combined .gvcf files] -o [sample+background_genotyped.vcf] &> GATK_GenotypeGVCF.log -o [sampe_mergeGvcf.vcf]
VariantRecalibrator
- SNP Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -input [sample+background_genotyped.vcf] -recalFile [sample_snp_recal] -tranchesFile [sample_snp_tranches] -rscriptFile [sample_snp_plots.R] -mode SNP &> GATK_VariantRecalibrator_SNP.log
- INDEL Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -numBadVariants: 5000 -mode INDEL -input [sample+background_genotyped.vcf] -recalFile [sample_indel_recal] -tranchesFile [sample_indel_tranches] -rscriptFile [sample_indel_plots.R] -mode INDEL
ApplyRecalibration
- SNP Apply
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -input [genotyped.vcf] -o [recal_SNP.vcf] -R human_g1k_v37_decoy.fasta -ts_filter_level 99.0 -excludeFiltered -tranchesFile [sample.snp.tranches] -recalFile [sample.snp.recal] -mode SNP &> ApplyRecalibration_SNP.report
- INDEL Apply
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -input [genotyped.vcf] -o [recal_INDEL.vcf] -R human_g1k_v37_decoy.fasta -ts_filter_level 99.0 -excludeFiltered -tranchesFile [sample.indel.tranches] -recalFile [sample.indel.recal] -mode INDEL &> ApplyRecalibration_INDEL.report
SelectVariants & CombineVarients
These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.
- Select SNP
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37_decoy.fasta -sn [target_1] -sn [target_2] -sn [...] --variant recal_SNP.vcf -o [cleaned_recal_SNP.file] &> SelectVariants_SNP.report
- Select INDEL
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37_decoy.fasta -sn [target_1] -sn [target_2] -sn [...] --variant recal_INDEL.vcf -o [cleaned_recal_INDEL.file] &> SelectVariants_INDEL.report
- CombineVarients
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37_decoy.fasta --variant cleaned_recal_SNP.file --variant cleaned_recal_INDEL.file -o [Final.vcf] &> CombineVarients.report
Variant File QC
Quality Metrics on variants
- Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
- HapMap concordance
- SNV/Indel Counts
- Rare variant enrichment
- DP
- Q
- GQ