Difference between revisions of "UGP Variant Pipeline 0.0.3"

From Utah Genome Project Wiki
Jump to navigation Jump to search
 
(38 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Utah Genome Project =
 
= Utah Genome Project =
 +
 
== Variant Calling Pipeline  Version 0.0.3 ==
 
== Variant Calling Pipeline  Version 0.0.3 ==
== Sept. 2013 ==
+
 
 +
== Jan. 2014 ==
 +
 
 +
===Software Versions===
 +
 
 +
**GenomeAnalysisTK-2.8-1
 +
**Picard : Version: 1.99
 +
**FastQC v0.10.1
 +
**Samtools Version: 0.1.19
 +
**BWA Version: 0.7.5
 +
**cApTUrE 0.03
  
 
=== Data Source ===
 
=== Data Source ===
Line 17: Line 28:
 
**known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
 
**known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
  
*Background BAMS used running UnifiedGenotyper
+
*1000Genomes Background BAMS ran concurrently with UnifiedGenotyper
**1000Genomes sets:
 
 
**CEU  
 
**CEU  
 
**GBR
 
**GBR
Line 33: Line 43:
 
== Sequencing ==
 
== Sequencing ==
  
This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm]
+
This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [http://support.illumina.com/help/SequencingAnalysisWorkflow/Content/Vault/Informatics/Sequencing_Analysis/CASAVA/swSEQ_mCA_FASTQFiles.htm]
  
 
=== Validate File Integrity with md5sum ===
 
=== Validate File Integrity with md5sum ===
Line 56: Line 66:
 
* Total Sequences (<span style="background:#FFFF00">Need to develop a acceptable range</span>)
 
* Total Sequences (<span style="background:#FFFF00">Need to develop a acceptable range</span>)
 
* Filtered Sequences      (Should be 0 or at least very low)
 
* Filtered Sequences      (Should be 0 or at least very low)
* Sequence length (must be ~ 100 bp)
+
* Sequence length (must be >= 100 bp)
 
* %GC    (<span style="background:#FFFF00">should be 45 < x < 55</span>)
 
* %GC    (<span style="background:#FFFF00">should be 45 < x < 55</span>)
 
* Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>).
 
* Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>).
Line 73: Line 83:
 
  bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
 
  bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
  
The 'bwa sampe'.  For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size and read group added.
+
The 'bwa sampe'.  For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
  
 
  bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -
 
  bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -
Line 89: Line 99:
 
=== Merge lane level BAMs to individual ===  
 
=== Merge lane level BAMs to individual ===  
  
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
+
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
 
     INPUT=Sample1_L1.rg.bam                     
 
     INPUT=Sample1_L1.rg.bam                     
 
     INPUT=Sample1_L2.rg.bam                     
 
     INPUT=Sample1_L2.rg.bam                     
Line 98: Line 108:
 
     SORT_ORDER: coordinate
 
     SORT_ORDER: coordinate
 
     ASSUME_SORTED: True  
 
     ASSUME_SORTED: True  
    MERGE_SEQUENCE_DICTIONARIES:
 
 
     USE_THREADING: True
 
     USE_THREADING: True
 
     2> MergeSamFiles.report
 
     2> MergeSamFiles.report
 
  
 
=== Mark Duplicates ===
 
=== Mark Duplicates ===
Line 107: Line 115:
 
Remove PCR/Optical duplicate reads
 
Remove PCR/Optical duplicate reads
  
  java -Xmx4g -jar MarkDuplicates.jar   \
+
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
     INPUT=Sample1.bam                  \
+
     INPUT=Sample1.bam                   
     OUTPUT=Sample1.dedup.bam            \
+
     OUTPUT=Sample1.dedup.bam             
     METRICS_FILE=lane1_dup_metrics.txt  \
+
     METRICS_FILE=lane1_dup_metrics.txt   
     ASSUME_SORTED=true                  \
+
     VALIDATION_STRINGENCY: LENIENT
     USE_THREADING=true                  \
+
     MAX_RECORDS_IN_RAM: 5000000
     VALIDATION_STRINGENCY=SILENT        \
+
     CREATE_INDEX: True
     TMP_DIR=/big/drive                  \
+
     ASSUME_SORTED: True     
     2> MarkDuplicates.error
+
     2> MarkDuplicates.log
  
 
<span style="background:#FFFF00">Develop range for duplicate levels here</span>
 
<span style="background:#FFFF00">Develop range for duplicate levels here</span>
Line 121: Line 129:
 
=== Local Realignment of Indels ===
 
=== Local Realignment of Indels ===
  
  java -Xmx4g -jar GenomeAnalysisTK.jar                 \
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
     -I All_BAMs.list                                    \
+
     -I dedup_bam.list                                     
     -o Realign_Intervals.txt                            \
+
     -o Realign_Intervals.txt                             
     -T RealignerTargetCreator                          \
+
     -T RealignerTargetCreator                           
     -R human_g1K_v37.fasta                              \
+
     -R human_g1k_v37_decoy.fasta                               
     -known Mills_and_1000G_gold_standard.indels.b37.vcf \
+
     -known [list from above]
    -known 1000G_phase1.indels.b37.vcf                  \
+
     -nt 24                                               
     -nt 24                                              \
+
     2> RealignerTargetCreator.log
     2> RealignerTargetCreator.error
 
 
 
java -Xmx4g -jar GenomeAnalysisTK.jar                  \
 
    -T IndelRealigner                                  \
 
    -I Sample1.dedup.bam                                \
 
    -o Sample1.realign.bam                              \
 
    -R human_g1K_v37.fasta                              \
 
    -targetIntervals Realign_Intervals.txt              \
 
    -known Mills_and_1000G_gold_standard.indels.b37.vcf \
 
    2> IndelRealigner.error
 
  
 
=== Base Quality Score Recalibration ===
 
=== Base Quality Score Recalibration ===
  
  java -Xmx4g -jar GenomeAnalysisTK.jar \
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T PrintReads                      \
+
   -T PrintReads                       
   -I Sample1.realign.bam              \
+
   -I Sample1.realign.bam               
   -o Sample1.recal.bam                \
+
   -o Sample1.recal.bam                 
   -R human_g1K_v37.fasta              \
+
   -R human_g1k_v37_decoy.fasta               
   -BQSR recalibration_report.grp      \
+
   -BQSR recalibration_report.grp       
  
 +
<span style="background:#FFFF00">
  
<span style="background:#FFFF00">
 
 
=== BAM Quality Control ===
 
=== BAM Quality Control ===
  
<span style="background:#FFFF00">
+
Via Picards tools CollectMultipleMetrics program
CollectAlignmentSummaryMetrics
+
  java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp
  java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \
+
    CollectMultipleMetrics.jar  
     I=/output/filename.b37_1kg.sorted.bam                                      \
+
     I=/output/filename.b37_1kg.sorted.bam                                       
     O=/output/filename.b37_1kg.AlignmentSummaryMetrics                        \
+
     O=/output/filename.b37_1kg.AlignmentSummaryMetrics                         
     R=human_g1K_v37.fasta                                                     \
+
     R= human_g1k_v37_decoy.fasta                                                  
     VALIDATION_STRINGENCY=LENIENT                                            \
+
     VALIDATION_STRINGENCY=LENIENT                                             
     TMP_DIR=/big/drive
+
     PROGRAM: QualityScoreDistribution
 +
 
 +
</span>
 +
 
 +
=== ReduceReads ===
 +
 
 +
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 +
  -T ReduceReads
 +
  -I Sample1.recal.bam
 +
  -o Sample1.reduced.bam
 +
  -R human_g1k_v37_decoy.fasta
 +
  2> ReduceReads.report
 +
 
 +
== Variant Calling ==
  
<span style="background:#FFFF00">
+
=== UnifiedGenotyper ===
CollectGcBiasMetrics
 
java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \
 
    R=human_g1K_v37.fasta                                          \
 
    I=/output/filename.b37_1kg.sorted.bam                            \
 
    O=/output/filename.b37_1kg.GcBiasMetrics                        \
 
    CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf                \
 
    VALIDATION_STRINGENCY=LENIENT                                  \
 
    TMP_DIR=/big/drive
 
  
<span style="background:#FFFF00">
+
SNP calling
CollectInsertSizeMetrics
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \
+
  -T UnifiedGenotyper                 
    I=/output/filename.b37_1kg.sorted.bam                                \
+
  -I reduced_bam.list
    O=/output/filename.b37_1kg.CollectInsertSizeMetrics                 \
+
  -I 1000G background_bam.list
    H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf              \
+
  -o Sample1.raw.vcf                  
    VALIDATION_STRINGENCY=LENIENT                                      \
+
  -R human_g1k_v37_decoy.fasta
    TMP_DIR=/big/drive
+
  -nt 4
 +
  -nct 6
 +
  -stand_call_conf 30.0
 +
  -stand_emit_conf 30.0
 +
  -glm SNP 
 +
  -out_mode EMIT_VARIANTS_ONLY
 +
  -L NCBI Ref_GRCh37 region.bed
 +
  2> UnifiedGenotyper.report
  
<span style="background:#FFFF00">
+
INDEL calling
MeanQualityByCycle
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \
+
  -T UnifiedGenotyper                 
    I=/output/filename.b37_1kg.sorted.bam                          \
+
  -I reduced_bam.list
    O=/output/filename.b37_1kg.MeanQualityByCycle                 \
+
  -I 1000G background_bam.list
    CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf          \
+
  -o Sample1.raw.vcf                  
    VALIDATION_STRINGENCY=LENIENT                                \
+
  -R human_g1k_v37_decoy.fasta
    TMP_DIR=/big/drive
+
  -nt 4
 +
  -nct 6
 +
  -stand_call_conf 30.0
 +
  -stand_emit_conf 30.0
 +
  -glm INDEL 
 +
  -out_mode EMIT_VARIANTS_ONLY
 +
  -L NCBI Ref_GRCh37 region.bed
 +
  2> UnifiedGenotyper.report
  
<span style="background:#FFFF00">
+
=== VariantRecalibrator ===
QualityScoreDistribution
 
java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar  \
 
    I=/output/filename.b37_1kg.sorted.bam                                \
 
    O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution]            \
 
    CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf    \
 
    VALIDATION_STRINGENCY=LENIENT                                        \
 
    TMP_DIR=/big/drive
 
  
<span style="background:#FFFF00">
+
SNP Recalibration
BamIndexStats
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \
+
  -T VariantRecalibrator
    INPUT=/output/filename.b37_1kg.sorted.bam                \
+
  -R human_g1k_v37_decoy.fasta
    VALIDATION_STRINGENCY=LENIENT                            \
+
  -input Sample1.raw.vcf
    TMP_DIR=/big/drive
+
  -nt 25
 +
  -resource: hapmap,known=false,training=true,truth=true,prior=15.0
 +
  -resource: omni,known=false,training=true,truth=true,prior=12.0
 +
  -resource: 1000G,known=false,training=true,truth=false,prior=10.0
 +
  -use_annotation: QD
 +
  -use_annotation: HaplotypeScore
 +
  -use_annotation: MQRankSum
 +
  -use_annotation: ReadPosRankSum
 +
  -use_annotation: FS
 +
  -numBadVariants: 5000
 +
  -mode SNP                         
 +
  -recalFile Sample1_SNP.recal           
 +
  -tranchesFile Sample1_SNP.tranches     
 +
  -rscriptFile Sample1_SNP.plots.R     
 +
  2> VariantRecalibrator_SNP.report
  
<span style="background:#FFFF00">
+
INDEL Recalibration
CalculateHsMetricsWholeGenome
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \
+
  -T VariantRecalibrator
    INPUT=/output/filename.b37_1kg.sorted.bam                                \
+
  -R human_g1k_v37_decoy.fasta
    OUTPUT=/output/filename.b37_1kg.HsMetrics                                \
+
  -input Sample1.raw.vcf
    BAIT_INTERVALS=/resources//hg19/intervals/GoNL.interval_list            \
+
  -nt 25
    TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list          \
+
  -resource: mills,known=false,training=true,truth=true,prior=12.0
    VALIDATION_STRINGENCY=LENIENT                                            \
+
  -resource: 1000G,known=false,training=true,truth=true,prior=10.0
    TMP_DIR=/big/drive
+
  -use_annotation: MQRankSum
</span>
+
  -use_annotation: ReadPosRankSum
 +
  -use_annotation: FS
 +
  -numBadVariants: 5000
 +
  -mode INDEL
 +
  -recalFile Sample1_INDEL.recal           
 +
  -tranchesFile Sample1_INDEL.tranches     
 +
  -rscriptFile Sample1_INDEL.plots.R     
 +
  2> VariantRecalibrator_INDEL.report
  
=== Reduce Reads ===
+
=== ApplyRecalibration ===
  
  java -Xmx4g -jar GenomeAnalysisTK.jar \
+
SNP Apply
   -T ReduceReads                      \
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -I Sample1.recal.bam                \
+
   -T ApplyRecalibration
   -o Sample1.reduced.bam              \
+
   -input Sample1_raw_SNP.vcf
   -R human_g1K_v37.fasta             \
+
   -o Sample1_vqsr_SNP.vcf
   2> ReduceReads.error
+
   -R human_g1k_v37_decoy.fasta
 +
  -nt 25
 +
  -ts_filter_level: 99.0
 +
  -excludeFiltered : TRUE
 +
  -tranchesFile Sample1.tranches
 +
  -recalFile Sample1.recal
 +
  -mode SNP
 +
   2> ApplyRecalibration_SNP.report
  
== Variant Calling ==
+
SNP INDEL
 +
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 +
  -T ApplyRecalibration
 +
  -input Sample1_raw_INDEL.vcf
 +
  -o Sample1_vqsr_INDEL.vcf
 +
  -R human_g1k_v37_decoy.fasta
 +
  -nt 25
 +
  -ts_filter_level: 99.0
 +
  -excludeFiltered : TRUE
 +
  -tranchesFile Sample1.tranches
 +
  -recalFile Sample1.recal
 +
  -mode INDEL
 +
  2> ApplyRecalibration_INDEL.report
  
=== UnifiedGenotyper ===
+
== SelectVariants==
java -Xmx4g -jar GenomeAnalysisTK.jar \
 
  -T UnifiedGenotyper                  \
 
  -I Sample1.reduced.bam              \
 
  -I Sample2.reduced.bam              \
 
  -I Background1.reduced.bam          \
 
  -I Background2.reduced.bam          \
 
  -o Sample1.raw.vcf                  \
 
  -R human_g1K_v37.fasta              \
 
  --dbsnp dbSNP.vcf                    \
 
  -dcov 250                          \
 
  -L TruSeq_exome_regions.bed        \
 
  2> UnifiedGenotyper.error
 
  
=== VariantRecalibrator ===
+
These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.
java -Xmx4g -jar GenomeAnalysisTK.jar \
 
  -T VariantRecalibrator              \
 
  -R human_g1K_v37.fasta              \
 
  -input Sample1.raw.vcf              \
 
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf          \
 
  -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf        \
 
  -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf            \
 
  
ADD 1KG
+
Select SNP
 +
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 +
  -T SelectVariants
 +
  -R human_g1k_v37_decoy.fasta
 +
  --variant SNP_variant
 +
  -o output_SNP.file
 +
  2> SelectVariants_SNP.report
  
  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
+
Select INDEL
   -mode SNP                          \
+
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -recalFile Sample1.recal            \
+
   -T SelectVariants
   -tranchesFile Sample1.tranches      \
+
   -R human_g1k_v37_decoy.fasta
   -rscriptFile Sample1.plots.R        \
+
   --variant INDEL_variant
   2> VariantRecalibrator.error
+
   -o output_INDEL.file
 +
   2> SelectVariants_INDEL.report
  
=== ApplyRecalibration ===
+
CombineVarients
  java -Xmx4g -jar GenomeAnalysisTK.jar \
+
  java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T ApplyRecalibration              \
+
   -T CombineVariants
   -input Sample1.raw.vcf              \
+
   -R human_g1k_v37_decoy.fasta
   -o Sample1.vqsr.vcf                \
+
   --variant INDEL_variant
  -R human_g1K_v37.fasta              \
+
   --variant  output_INDEL.file
   --ts_filter_level 99.0                \
+
   --variant output_SNP.file
   -tranchesFile Sample1.tranches      \
+
   -o Final.vcf
   -recalFile Sample1.recal            \
+
   2> CombineVarients.report
  -mode SNP                          \
 
   2> ApplyRecalibration.error
 
  
 
== Variant File QC ==
 
== Variant File QC ==

Latest revision as of 16:43, 21 August 2014

Utah Genome Project

Variant Calling Pipeline Version 0.0.3

Jan. 2014

Software Versions

    • GenomeAnalysisTK-2.8-1
    • Picard : Version: 1.99
    • FastQC v0.10.1
    • Samtools Version: 0.1.19
    • BWA Version: 0.7.5
    • cApTUrE 0.03

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
  • Reference Genome (GRCh37):
    • human_g1k_v37_decoy.fasta
  • VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
    • known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
    • known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
    • known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
  • 1000Genomes Background BAMS ran concurrently with UnifiedGenotyper
    • CEU
    • GBR
  • Resource files for VariantRecalibrator_SNP
    • hapmap_3.3.b37.vcf
    • 1000G_omni2.5.b37.vcf
    • 1000G_phase1.snps.high_confidence.b37.vcf
  • Resource files for VariantRecalibrator_INDEL
    • Mills_and_1000G_gold_standard.indels.b37.vcf
    • 1000G_phase1.indels.b37.vcf

Sequencing

This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5

If the md5sum signature differs from that provided for the file:

  • Check to be sure you have the correct file.
  • Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
  • Try the download again.
  • Contact the sequence provider.

FastQ File Analyses

fastqc Sample1_L1_R1.txt

From the fastqc_data.txt file we check the following values:

  • Encoding (must be Sanger / Illumina 1.9)
  • Total Sequences (Need to develop a acceptable range)
  • Filtered Sequences (Should be 0 or at least very low)
  • Sequence length (must be >= 100 bp)
  • %GC (should be 45 < x < 55)
  • Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).

Alignment

Align reads to the genome with bwa.

Index the reference sequence if this has not already been done

bwa index -a bwtsw human_g1k_v37_decoy.fasta

The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.

bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair
bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair

The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.

bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -

This will switch to bwa mem soon.

bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE - 

BAM File Analyses

Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.

Merge lane level BAMs to individual

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
    INPUT=Sample1_L1.rg.bam                     
    INPUT=Sample1_L2.rg.bam                    
    OUTPUT=Sample1.bam                          
    VALIDATION_STRINGENCY: LENIENT
    MAX_RECORDS_IN_RAM: 5000000
    CREATE_INDEX: True
    SORT_ORDER: coordinate
    ASSUME_SORTED: True 
    USE_THREADING: True
    2> MergeSamFiles.report

Mark Duplicates

Remove PCR/Optical duplicate reads

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar
   INPUT=Sample1.bam                   
   OUTPUT=Sample1.dedup.bam            
   METRICS_FILE=lane1_dup_metrics.txt  
   VALIDATION_STRINGENCY: LENIENT
   MAX_RECORDS_IN_RAM: 5000000
   CREATE_INDEX: True
   ASSUME_SORTED: True       
   2> MarkDuplicates.log

Develop range for duplicate levels here

Local Realignment of Indels

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -I dedup_bam.list                                    
   -o Realign_Intervals.txt                            
   -T RealignerTargetCreator                           
   -R human_g1k_v37_decoy.fasta                              
   -known [list from above]
   -nt 24                                              
   2> RealignerTargetCreator.log

Base Quality Score Recalibration

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T PrintReads                       
  -I Sample1.realign.bam              
  -o Sample1.recal.bam                
  -R human_g1k_v37_decoy.fasta              
  -BQSR recalibration_report.grp      

BAM Quality Control

Via Picards tools CollectMultipleMetrics program

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp 
   CollectMultipleMetrics.jar 
   I=/output/filename.b37_1kg.sorted.bam                                      
   O=/output/filename.b37_1kg.AlignmentSummaryMetrics                         
   R= human_g1k_v37_decoy.fasta                                                    
   VALIDATION_STRINGENCY=LENIENT                                             
   PROGRAM: QualityScoreDistribution

ReduceReads

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ReduceReads
  -I Sample1.recal.bam
  -o Sample1.reduced.bam
  -R human_g1k_v37_decoy.fasta
  2> ReduceReads.report

Variant Calling

UnifiedGenotyper

SNP calling

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T UnifiedGenotyper                  
  -I reduced_bam.list
  -I 1000G background_bam.list
  -o Sample1.raw.vcf                  
  -R human_g1k_v37_decoy.fasta
  -nt 4
  -nct 6
  -stand_call_conf 30.0
  -stand_emit_conf 30.0
  -glm SNP   
  -out_mode EMIT_VARIANTS_ONLY
  -L NCBI Ref_GRCh37 region.bed 
  2> UnifiedGenotyper.report

INDEL calling

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T UnifiedGenotyper                  
  -I reduced_bam.list
  -I 1000G background_bam.list
  -o Sample1.raw.vcf                  
  -R human_g1k_v37_decoy.fasta
  -nt 4
  -nct 6
  -stand_call_conf 30.0
  -stand_emit_conf 30.0
  -glm INDEL   
  -out_mode EMIT_VARIANTS_ONLY
  -L NCBI Ref_GRCh37 region.bed 
  2> UnifiedGenotyper.report

VariantRecalibrator

SNP Recalibration

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -input Sample1.raw.vcf
  -nt 25
  -resource: hapmap,known=false,training=true,truth=true,prior=15.0
  -resource: omni,known=false,training=true,truth=true,prior=12.0
  -resource: 1000G,known=false,training=true,truth=false,prior=10.0
  -use_annotation: QD
  -use_annotation: HaplotypeScore
  -use_annotation: MQRankSum
  -use_annotation: ReadPosRankSum
  -use_annotation: FS
  -numBadVariants: 5000
  -mode SNP                           
  -recalFile Sample1_SNP.recal            
  -tranchesFile Sample1_SNP.tranches      
  -rscriptFile Sample1_SNP.plots.R       
  2> VariantRecalibrator_SNP.report

INDEL Recalibration

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T VariantRecalibrator
  -R human_g1k_v37_decoy.fasta
  -input Sample1.raw.vcf
  -nt 25
  -resource: mills,known=false,training=true,truth=true,prior=12.0
  -resource: 1000G,known=false,training=true,truth=true,prior=10.0
  -use_annotation: MQRankSum
  -use_annotation: ReadPosRankSum
  -use_annotation: FS
  -numBadVariants: 5000
  -mode INDEL
  -recalFile Sample1_INDEL.recal            
  -tranchesFile Sample1_INDEL.tranches      
  -rscriptFile Sample1_INDEL.plots.R       
  2> VariantRecalibrator_INDEL.report

ApplyRecalibration

SNP Apply

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input Sample1_raw_SNP.vcf
  -o Sample1_vqsr_SNP.vcf
  -R human_g1k_v37_decoy.fasta
  -nt 25
  -ts_filter_level: 99.0
  -excludeFiltered : TRUE
  -tranchesFile Sample1.tranches
  -recalFile Sample1.recal
  -mode SNP 
  2> ApplyRecalibration_SNP.report

SNP INDEL

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T ApplyRecalibration
  -input Sample1_raw_INDEL.vcf
  -o Sample1_vqsr_INDEL.vcf
  -R human_g1k_v37_decoy.fasta
  -nt 25
  -ts_filter_level: 99.0
  -excludeFiltered : TRUE
  -tranchesFile Sample1.tranches
  -recalFile Sample1.recal
  -mode INDEL 
  2> ApplyRecalibration_INDEL.report

SelectVariants

These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.

Select SNP

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T SelectVariants
  -R human_g1k_v37_decoy.fasta
  --variant SNP_variant 
  -o output_SNP.file 
  2> SelectVariants_SNP.report

Select INDEL

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T SelectVariants
  -R human_g1k_v37_decoy.fasta
  --variant INDEL_variant 
  -o output_INDEL.file 
  2> SelectVariants_INDEL.report

CombineVarients

java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
  -T CombineVariants
  -R human_g1k_v37_decoy.fasta
  --variant INDEL_variant 
  --variant  output_INDEL.file
  --variant output_SNP.file
  -o Final.vcf 
  2> CombineVarients.report

Variant File QC

Quality Metrics on variants

  • Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
  • HapMap concordance
  • SNV/Indel Counts
  • Rare variant enrichment
  • DP
  • Q
  • GQ