Difference between revisions of "UGP Variant Pipeline 1.0.3"

From Utah Genome Project Wiki
Jump to navigation Jump to search
 
(70 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Utah Genome Project =
 
= Utah Genome Project =
  
  May. 2014  
+
  Aug. 2014  
 
  Variant Calling Pipeline  Version 1.0.3
 
  Variant Calling Pipeline  Version 1.0.3
  
 
===Software Versions===
 
===Software Versions===
 
+
**GenomeAnalysisTK-3.3-2
**GenomeAnalysisTK-3.3-1
 
 
**Picard : Version: 1.112
 
**Picard : Version: 1.112
 
**FastQC v0.10.1
 
**FastQC v0.10.1
Line 27: Line 26:
 
**known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
 
**known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
  
*1000Genomes Background BAMS ran concurrently with HaplotypeCaller
+
*Background Files
**CEU  
+
**We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
 +
 
 +
*Groups Currently completed:
 +
**CEU
 
**GBR
 
**GBR
**FIN
+
 
 +
If you would like a copy of these file please contact me.
 +
shawn.rynearson@gmail.com
  
 
*Resource files for VariantRecalibrator_SNP
 
*Resource files for VariantRecalibrator_SNP
Line 61: Line 65:
  
 
  fastqc Sample1_L1_R1.txt
 
  fastqc Sample1_L1_R1.txt
 +
 +
From the sumamry.txt report we check
 +
* FAIL sections
  
 
From the fastqc_data.txt file we check the following values:
 
From the fastqc_data.txt file we check the following values:
 
* Encoding (must be Sanger / Illumina 1.9)
 
* Encoding (must be Sanger / Illumina 1.9)
* Total Sequences (<span style="background:#FFFF00">Need to develop a acceptable range</span>)
+
* Total Sequences (Currently set to 30000000)
* Filtered Sequences     (Should be 0 or at least very low)
+
* Filtered Sequences (Currently set to less then 5)
 
* Sequence length (must be >= 100 bp)
 
* Sequence length (must be >= 100 bp)
* %GC     (<span style="background:#FFFF00">should be 45 < x < 55</span>)
+
* %GC (45 < x < 55)
* Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>).
+
* Total Duplicate Percentage (Currently set to 60.0)
  
== Alignment ==
+
Result are reported to QC-report.txt
  
Align reads to the genome with bwa.
+
== Indexing ==
  
Index the reference sequence if this has not already been done
+
The following indexing is required using BWA, Picard and SamTools.  GATK requires all three.  However this step only needs to be done once "per-machine".
  
 +
*BWA
 
  bwa index -a bwtsw human_g1k_v37_decoy.fasta
 
  bwa index -a bwtsw human_g1k_v37_decoy.fasta
  
*BWA-mem
+
*Picard
 +
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
 +
 
 +
*SamTools
 +
samtools faidx human_g1k_v37_decoy.fasta
 +
 
 +
== Alignment ==
 +
 
 +
Align reads to the genome with bwa.
  
The 'bwa mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
+
The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
  
 
  bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -
 
  bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -
  
=== BAM File Analyses ===
+
== BAM File Analyses and Processing ==
  
 
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
 
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
  
 
=== Merge lane level BAMs to individual ===  
 
=== Merge lane level BAMs to individual ===  
 +
 +
* This step only needs to run if you have multiple lanes per sample.
  
 
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
 
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
     INPUT=Sample1_L1.rg.bam                    
+
     INPUT=[Lane 1 bam file]
     INPUT=Sample1_L2.rg.bam                   
+
    INPUT=[Lane 2 bam file]
 +
     INPUT=[ ... ]
 
     OUTPUT=Sample1.bam                           
 
     OUTPUT=Sample1.bam                           
 
     VALIDATION_STRINGENCY: LENIENT
 
     VALIDATION_STRINGENCY: LENIENT
Line 107: Line 126:
  
 
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
 
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
     INPUT=Sample1.bam                  
+
     INPUT=[bam files]
     OUTPUT=Sample1.dedup.bam          
+
     OUTPUT=[dedup bam files]
 
     METRICS_FILE=lane1_dup_metrics.txt   
 
     METRICS_FILE=lane1_dup_metrics.txt   
 
     VALIDATION_STRINGENCY: LENIENT
 
     VALIDATION_STRINGENCY: LENIENT
Line 116: Line 135:
 
     2> MarkDuplicates.log
 
     2> MarkDuplicates.log
  
<span style="background:#FFFF00">Develop range for duplicate levels here</span>
+
<span style="background:#FFFF00">Currently all duplicates are flagged to allow GATK to handle them.</span>
 +
 
 +
=== BAM Quality Control ===
 +
 
 +
*CollectMultipleMetrics
 +
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp
 +
    CollectMultipleMetrics.jar
 +
    I=[dedup bam files]
 +
  O=[dedup bam files Metrics]
 +
    R= human_g1k_v37_decoy.fasta                                                   
 +
    VALIDATION_STRINGENCY=LENIENT                                           
 +
    PROGRAM: QualityScoreDistribution
 +
  &> Picard_CollectMultipleMetrics.log
  
 
=== Local Realignment of Indels ===
 
=== Local Realignment of Indels ===
  
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
+
*RealignerTargetCreator
     -I dedup_bam.list                                   
+
 
     -o Realign_Intervals.txt                           
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar                          
    -T RealignerTargetCreator                         
+
     -T RealignerTargetCreator
 +
     -I [bam files]
 
     -R human_g1k_v37_decoy.fasta                               
 
     -R human_g1k_v37_decoy.fasta                               
     -known [list from above]
+
     -known /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
 +
    -known /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
 +
    -o bam_file_realign.intervals
 
     -nt 24                                               
 
     -nt 24                                               
     2> RealignerTargetCreator.log
+
     &> RealignerTargetCreator.log
  
=== Base Quality Score Recalibration ===
+
*IndelRealigner
 +
 
 +
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 +
    -T IndelRealigner
 +
    -R human_g1k_v37_decoy.fasta
 +
    -I [bam files]
 +
    -targetIntervals  realign.intervals
 +
    -known Mills_and_1000G_gold_standard.indels.b37.vcf
 +
    -known 1000G_phase1.indels.b37.vcf
 +
    -o [dedup_realign bam files]
 +
 
 +
=== Base Quality Score Recalibration & PrintReads ===
 +
 
 +
*BaseRecalibrator
 +
 
 +
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 +
  -T BaseRecalibrator
 +
  -I [bam files]
 +
  -R human_g1k_v37_decoy.fasta
 +
  -knownSites /data/GATK_Bundle/dbsnp_137.b37.vcf
 +
  -o [sorted_Dedup_realign_recal_data.table files]
 +
  &> GATK_BaseRecalibrator.log
 +
 
 +
*PrintReads
  
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
Line 140: Line 197:
 
<span style="background:#FFFF00">
 
<span style="background:#FFFF00">
  
=== BAM Quality Control ===
+
== Variant Calling ==
  
<span style="background:#FFFF00">
+
=== HaplotypeCaller ===
CollectAlignmentSummaryMetrics
 
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp CollectAlignmentSummaryMetrics.jar
 
    I=/output/filename.b37_1kg.sorted.bam                                     
 
    O=/output/filename.b37_1kg.AlignmentSummaryMetrics                       
 
    R= human_g1k_v37_decoy.fasta                                                   
 
    VALIDATION_STRINGENCY=LENIENT                                           
 
    PROGRAM: QualityScoreDistribution
 
  
</span>
+
*Now HaplotypeCaller handels SNP and INDEL calls.
  
=== ReduceReads ===
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
+
   -T HaplotypeCaller                 
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
+
   -I [bam files]
   -T ReduceReads
+
   -o [raw.snps.indels.gvcf files]           
   -I Sample1.recal.bam
 
   -o Sample1.reduced.bam
 
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
   2> ReduceReads.report
+
   -variant_index_type LINEAR
 
+
  -stand_call_conf 30.0
== Variant Calling ==
+
  -stand_emit_conf 30.0
 +
  -emitRefConfidence GVCF
 +
  -variant_index_parameter 128000
 +
  -L NCBI Ref_GRCh37 exon.region.list
 +
  &> GATK_HaplotypeCaller.log
  
=== UnifiedGenotyper ===
+
=== CombineGVCFs ===
  
SNP calling
 
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T UnifiedGenotyper                 
+
   -T CombineGVCFs
  -I reduced_bam.list
 
  -I 1000G background_bam.list
 
  -o Sample1.raw.vcf                 
 
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
   -nt 4
+
   -variant [all gvcf files created by HaplotypeCaller]
   -nct 6
+
   &> GATK_CombineGVCF.log
  -stand_call_conf 30.0
+
 
  -stand_emit_conf 30.0
+
=== GenotypeGVCFs ===
  -glm SNP 
 
  -out_mode EMIT_VARIANTS_ONLY
 
  -L NCBI Ref_GRCh37 region.bed
 
  2> UnifiedGenotyper.report
 
  
INDEL calling
 
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T UnifiedGenotyper                 
+
   -T GenotypeGVCFs
  -I reduced_bam.list
 
  -I 1000G background_bam.list
 
  -o Sample1.raw.vcf                 
 
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
   -nt 4
+
   -variant [sampe_mergeGvcf.vcf]
   -nct 6
+
   -variant [1000G pre-combined .gvcf files]
  -stand_call_conf 30.0
+
   -o [sample+background_genotyped.vcf]
   -stand_emit_conf 30.0
+
   &> GATK_GenotypeGVCF.log
   -glm INDEL 
+
   -o [sampe_mergeGvcf.vcf]
   -out_mode EMIT_VARIANTS_ONLY
 
  -L NCBI Ref_GRCh37 region.bed
 
  2> UnifiedGenotyper.report
 
  
 
=== VariantRecalibrator ===
 
=== VariantRecalibrator ===
  
SNP Recalibration
+
*SNP Recalibration
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
   -T VariantRecalibrator
 
   -T VariantRecalibrator
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
  -input Sample1.raw.vcf
+
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf
  -nt 25
+
   -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf
   -resource: hapmap,known=false,training=true,truth=true,prior=15.0
+
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf
   -resource: omni,known=false,training=true,truth=true,prior=12.0
+
   -an QD -an MQRankSum -an ReadPosRankSum -an FS
   -resource: 1000G,known=false,training=true,truth=false,prior=10.0
+
   -input [sample+background_genotyped.vcf]
   -use_annotation: QD
+
   -recalFile [sample_snp_recal]
  -use_annotation: HaplotypeScore
+
   -tranchesFile [sample_snp_tranches]
  -use_annotation: MQRankSum
+
   -rscriptFile [sample_snp_plots.R]
  -use_annotation: ReadPosRankSum
+
  -mode SNP
  -use_annotation: FS
+
   &> GATK_VariantRecalibrator_SNP.log
   -numBadVariants: 5000
 
  -mode SNP                         
 
   -recalFile Sample1_SNP.recal           
 
   -tranchesFile Sample1_SNP.tranches     
 
   -rscriptFile Sample1_SNP.plots.R      
 
   2> VariantRecalibrator_SNP.report
 
  
INDEL Recalibration
+
*INDEL Recalibration
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
   -T VariantRecalibrator
 
   -T VariantRecalibrator
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
  -input Sample1.raw.vcf
+
   -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
  -nt 25
+
   -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
   -resource: mills,known=false,training=true,truth=true,prior=12.0
+
   -an MQRankSum -an ReadPosRankSum -an FS
   -resource: 1000G,known=false,training=true,truth=true,prior=10.0
 
   -use_annotation: MQRankSum
 
  -use_annotation: ReadPosRankSum
 
  -use_annotation: FS
 
 
   -numBadVariants: 5000
 
   -numBadVariants: 5000
 
   -mode INDEL
 
   -mode INDEL
   -recalFile Sample1_INDEL.recal           
+
  -input [sample+background_genotyped.vcf]
   -tranchesFile Sample1_INDEL.tranches     
+
   -recalFile [sample_indel_recal]
   -rscriptFile Sample1_INDEL.plots.R      
+
   -tranchesFile [sample_indel_tranches]
   2> VariantRecalibrator_INDEL.report
+
   -rscriptFile [sample_indel_plots.R]
 +
   -mode INDEL
  
 
=== ApplyRecalibration ===
 
=== ApplyRecalibration ===
  
SNP Apply
+
*SNP Apply
 +
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
   -T ApplyRecalibration
 
   -T ApplyRecalibration
   -input Sample1_raw_SNP.vcf
+
   -input [genotyped.vcf]
   -o Sample1_vqsr_SNP.vcf
+
   -o [recal_SNP.vcf]
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
  -nt 25
+
   -ts_filter_level 99.0
   -ts_filter_level: 99.0
+
   -excludeFiltered
   -excludeFiltered : TRUE
+
   -tranchesFile [sample.snp.tranches]
   -tranchesFile Sample1.tranches
+
   -recalFile [sample.snp.recal]
   -recalFile Sample1.recal
 
 
   -mode SNP  
 
   -mode SNP  
   2> ApplyRecalibration_SNP.report
+
   &> ApplyRecalibration_SNP.report
 +
 
 +
*INDEL Apply
  
SNP INDEL
 
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
   -T ApplyRecalibration
 
   -T ApplyRecalibration
   -input Sample1_raw_INDEL.vcf
+
   -input [genotyped.vcf]
   -o Sample1_vqsr_INDEL.vcf
+
   -o [recal_INDEL.vcf]
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
  -nt 25
+
   -ts_filter_level 99.0
   -ts_filter_level: 99.0
+
   -excludeFiltered
   -excludeFiltered : TRUE
+
   -tranchesFile [sample.indel.tranches]
   -tranchesFile Sample1.tranches
+
   -recalFile [sample.indel.recal]
   -recalFile Sample1.recal
+
   -mode INDEL
   -mode INDEL  
+
   &> ApplyRecalibration_INDEL.report
   2> ApplyRecalibration_INDEL.report
 
  
== SelectVariants==
+
== CombineVarients ==
  
These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.
+
These command will combine INDEL and SNP files into a single VCF file.
 
+
*CombineVarients
Select SNP
 
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  -T SelectVariants
 
  -R human_g1k_v37_decoy.fasta
 
  --variant SNP_variant
 
  -o output_SNP.file
 
  2> SelectVariants_SNP.report
 
 
 
Select INDEL
 
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  -T SelectVariants
 
  -R human_g1k_v37_decoy.fasta
 
  --variant INDEL_variant
 
  -o output_INDEL.file
 
  2> SelectVariants_INDEL.report
 
  
CombineVarients
 
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
   -T CombineVariants
 
   -T CombineVariants
 
   -R human_g1k_v37_decoy.fasta
 
   -R human_g1k_v37_decoy.fasta
   --variant INDEL_variant
+
   --variant cleaned_recal_SNP.file
  --variant  output_INDEL.file
+
   --variant cleaned_recal_INDEL.file
   --variant output_SNP.file
+
   -o [sample+background_Final.vcf]
   -o Final.vcf  
+
   &> CombineVarients.report
   2> CombineVarients.report
 
  
 
== Variant File QC ==
 
== Variant File QC ==

Latest revision as of 20:55, 21 August 2014

Utah Genome Project

Aug. 2014 
Variant Calling Pipeline  Version 1.0.3

Software Versions

    • GenomeAnalysisTK-3.3-2
    • Picard : Version: 1.112
    • FastQC v0.10.1
    • Samtools Version: 0.1.19
    • BWA Version: 0.7.5
    • cApTUrE 1.0.3

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
  • Reference Genome (GRCh37):
    • human_g1k_v37_decoy.fasta
  • VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
    • known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
    • known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
    • known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
  • Background Files
    • We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
  • Groups Currently completed:
    • CEU
    • GBR

If you would like a copy of these file please contact me.

shawn.rynearson@gmail.com
  • Resource files for VariantRecalibrator_SNP
    • hapmap_3.3.b37.vcf
    • 1000G_omni2.5.b37.vcf
    • 1000G_phase1.snps.high_confidence.b37.vcf
  • Resource files for VariantRecalibrator_INDEL
    • Mills_and_1000G_gold_standard.indels.b37.vcf
    • 1000G_phase1.indels.b37.vcf

Sequencing

This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5

If the md5sum signature differs from that provided for the file:

  • Check to be sure you have the correct file.
  • Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
  • Try the download again.
  • Contact the sequence provider.

FastQ File Analyses

fastqc Sample1_L1_R1.txt

From the sumamry.txt report we check

  • FAIL sections

From the fastqc_data.txt file we check the following values:

  • Encoding (must be Sanger / Illumina 1.9)
  • Total Sequences (Currently set to 30000000)
  • Filtered Sequences (Currently set to less then 5)
  • Sequence length (must be >= 100 bp)
  • %GC (45 < x < 55)
  • Total Duplicate Percentage (Currently set to 60.0)
Result are reported to QC-report.txt

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

  • BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
  • Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
  • SamTools
samtools faidx human_g1k_v37_decoy.fasta

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -

BAM File Analyses and Processing

Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.

Merge lane level BAMs to individual

  • This step only needs to run if you have multiple lanes per sample.
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
    INPUT=[Lane 1 bam file]
    INPUT=[Lane 2 bam file]
    INPUT=[ ... ]
    OUTPUT=Sample1.bam                          
    VALIDATION_STRINGENCY: LENIENT
    MAX_RECORDS_IN_RAM: 5000000
    CREATE_INDEX: True
    SORT_ORDER: coordinate
    ASSUME_SORTED: True 
    USE_THREADING: True
    2> MergeSamFiles.report

Mark Duplicates

Remove PCR/Optical duplicate reads

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
   INPUT=[bam files]
   OUTPUT=[dedup bam files]
   METRICS_FILE=lane1_dup_metrics.txt  
   VALIDATION_STRINGENCY: LENIENT
   MAX_RECORDS_IN_RAM: 5000000
   CREATE_INDEX: True
   ASSUME_SORTED: True       
   2> MarkDuplicates.log

Currently all duplicates are flagged to allow GATK to handle them.

BAM Quality Control

  • CollectMultipleMetrics
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp 
   CollectMultipleMetrics.jar 
   I=[dedup bam files]
  O=[dedup bam files Metrics]
   R= human_g1k_v37_decoy.fasta                                                    
   VALIDATION_STRINGENCY=LENIENT                                             
   PROGRAM: QualityScoreDistribution
  &> Picard_CollectMultipleMetrics.log

Local Realignment of Indels

  • RealignerTargetCreator
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar                            
   -T RealignerTargetCreator
   -I [bam files]
   -R human_g1k_v37_decoy.fasta                              
   -known /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
   -known /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
   -o bam_file_realign.intervals
   -nt 24                                              
   &> RealignerTargetCreator.log
  • IndelRealigner
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T IndelRealigner 
   -R human_g1k_v37_decoy.fasta
   -I [bam files]
   -targetIntervals  realign.intervals
   -known Mills_and_1000G_gold_standard.indels.b37.vcf
   -known 1000G_phase1.indels.b37.vcf
   -o [dedup_realign bam files]

Base Quality Score Recalibration & PrintReads

  • BaseRecalibrator
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T BaseRecalibrator
  -I [bam files]
  -R human_g1k_v37_decoy.fasta
  -knownSites /data/GATK_Bundle/dbsnp_137.b37.vcf
  -o [sorted_Dedup_realign_recal_data.table files]
  &> GATK_BaseRecalibrator.log
  
  • PrintReads
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T PrintReads                       
  -I Sample1.realign.bam              
  -o Sample1.recal.bam                
  -R human_g1k_v37_decoy.fasta              
  -BQSR recalibration_report.grp      

Variant Calling

HaplotypeCaller

  • Now HaplotypeCaller handels SNP and INDEL calls.
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T HaplotypeCaller                  
  -I [bam files]
  -o [raw.snps.indels.gvcf files]             
  -R human_g1k_v37_decoy.fasta
  -variant_index_type LINEAR
  -stand_call_conf 30.0
  -stand_emit_conf 30.0
  -emitRefConfidence GVCF
  -variant_index_parameter 128000
  -L NCBI Ref_GRCh37 exon.region.list
  &> GATK_HaplotypeCaller.log

CombineGVCFs

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T CombineGVCFs
  -R human_g1k_v37_decoy.fasta
  -variant [all gvcf files created by HaplotypeCaller]
  &> GATK_CombineGVCF.log

GenotypeGVCFs

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T GenotypeGVCFs
  -R human_g1k_v37_decoy.fasta
  -variant [sampe_mergeGvcf.vcf]
  -variant [1000G pre-combined .gvcf files]
  -o [sample+background_genotyped.vcf]
  &> GATK_GenotypeGVCF.log
  -o [sampe_mergeGvcf.vcf]

VariantRecalibrator

  • SNP Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /data/GATK_Bundle/hapmap_3.3.b37.vcf 
  -resource:omni,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/1000G_omni2.5.b37.vcf 
  -resource:1000G,known=false,training=true,truth=false,prior=10.0 /data/GATK_Bundle/1000G_phase1.snps.high_confidence.b37.vcf 
  -an QD -an MQRankSum -an ReadPosRankSum -an FS
  -input [sample+background_genotyped.vcf]
  -recalFile [sample_snp_recal]
  -tranchesFile [sample_snp_tranches]
  -rscriptFile [sample_snp_plots.R]
  -mode SNP 
  &> GATK_VariantRecalibrator_SNP.log
  • INDEL Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -resource:mills,known=false,training=true,truth=true,prior=12.0 /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf 
  -resource:1000G,known=false,training=true,truth=true,prior=10.0 /data/GATK_Bundle/1000G_phase1.indels.b37.vcf 
  -an MQRankSum -an ReadPosRankSum -an FS
  -numBadVariants: 5000
  -mode INDEL
  -input [sample+background_genotyped.vcf]
  -recalFile [sample_indel_recal]
  -tranchesFile [sample_indel_tranches]
  -rscriptFile [sample_indel_plots.R]
  -mode INDEL

ApplyRecalibration

  • SNP Apply
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input [genotyped.vcf]
  -o [recal_SNP.vcf]
  -R human_g1k_v37_decoy.fasta
  -ts_filter_level 99.0
  -excludeFiltered
  -tranchesFile [sample.snp.tranches]
  -recalFile [sample.snp.recal]
  -mode SNP 
  &> ApplyRecalibration_SNP.report
  • INDEL Apply
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input [genotyped.vcf]
  -o [recal_INDEL.vcf]
  -R human_g1k_v37_decoy.fasta
  -ts_filter_level 99.0
  -excludeFiltered
  -tranchesFile [sample.indel.tranches]
  -recalFile [sample.indel.recal]
  -mode INDEL
  &> ApplyRecalibration_INDEL.report

CombineVarients

These command will combine INDEL and SNP files into a single VCF file.

  • CombineVarients
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T CombineVariants
  -R human_g1k_v37_decoy.fasta
  --variant cleaned_recal_SNP.file
  --variant cleaned_recal_INDEL.file
  -o [sample+background_Final.vcf]
  &> CombineVarients.report

Variant File QC

Quality Metrics on variants

  • Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
  • HapMap concordance
  • SNV/Indel Counts
  • Rare variant enrichment
  • DP
  • Q
  • GQ