Difference between revisions of "UGP Variant Pipeline 0.0.2"

From Utah Genome Project Wiki
Jump to navigation Jump to search
(Created page with "= Utah Genome Project = == Variant Calling Pipeline == === Data Source === Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK...")
 
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Utah Genome Project =
 
= Utah Genome Project =
== Variant Calling Pipeline ==
+
== Variant Calling Pipeline Version 0.0.2 ==
 
+
== Sept. 2013 ==
  
 
=== Data Source ===
 
=== Data Source ===
  
 
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.5
 
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.5
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5
+
 
 +
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
  
 
* Reference Genome (GRCh37):
 
* Reference Genome (GRCh37):
human_g1K_v37.fasta
+
**human_g1K_v37.fasta
  
*
+
* High confidence SNVs
 +
**1000G_omni2.5.b37.vcf
 +
**hapmap_3.3.b37.vcf
  
== Samples ==
+
* dbSNP
 +
**dbsnp_137.b37.vcf.gz
  
This pipeline is currently designed for 101 bp Illumina HiSeq PE exome sequence data
+
* Known indels
 
+
**Mills_and_1000G_gold_standard.indels.b37.vcf.gz
The starting point for sample prep guidelines is the [http://bioserver.hci.utah.edu/BioInfo/index.php/Sequencing HCI Bioinfo/Genomics Cores Wiki sample prep page]. Refer to that page and we can lock down details here as that becomes appropriate.
+
**1000G_phase1.indels.b37.vcf
  
 
== Sequencing ==
 
== Sequencing ==
  
This protocol is designed to be applied to Illumina HiSeq paired-end 101 bp sequencing with an insert size of 200-500.
+
This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding.
 
 
Data that we will need from sequencing and library prep for ReadGroup designation later on.  See the SAM format page for details:
 
 
 
*ID: Read group identifier.
 
*CN: Name of sequencing center producing the read.
 
*DS: Description.
 
*DT: Date the run was produced (ISO8601 date or date/time).
 
*FO: Flow order. The array of nucleotide bases that correspond to the nucleotides used for each flow of each read.
 
*KS: The array of nucleotide bases that correspond to the key sequence of each read.
 
*LB: Library.
 
*PG: Programs used for processing the read group.
 
*PI: Predicted median insert size.
 
*PL: Platform/technology used to produce the reads. One of (CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO).
 
*PU: Platform unit (e.g. owcell-barcode.lane for Illumina or slide for SOLiD). Unique identifier.
 
*SM: Sample. Use pool name where a pool is being sequenced.
 
 
 
== FastQ File Analyses ==
 
  
 
=== Validate File Integrity with md5sum ===
 
=== Validate File Integrity with md5sum ===
  
An md5sum signature should be provided for each FastQ file by the group providing the sequencing services.  After the file has been downloaded locally check the md5sum to be sure no data corruption occurred during the file transfer.
+
An md5sum signature should be provided for each FastQ file by the sequencing center.  After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.
  
 
  md5sum file.fastq > file_local.md5
 
  md5sum file.fastq > file_local.md5
Line 53: Line 40:
 
* Contact the sequence provider.
 
* Contact the sequence provider.
  
=== Check Sequence Quality with FastQC ===
+
== FastQ File Analyses ==
  
[http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FastQC] is a Java utility that calculates a number of metrics that are useful in assessing the quality of sequence data.
+
fastqc Sample1_L1_R1.txt
  
fastqc 9685X1_130201_SN1117_0141_BC1JBVACXX_1_1.txt
+
From the fastqc_data.txt file we check the following values:
 
+
* Encoding (must be Sanger / Illumina 1.9)
FastQC provides the following metrics:
+
* Total Sequences (<span style="background:#FFFF00">Need to develop a acceptable range</span>)
 
+
* Filtered Sequences      (Should be 0 or at least very low)
*Basic Statistics
+
* Sequence length (must be ~ 100 bp)
*Per Base Sequence Quality
+
* %GC     (<span style="background:#FFFF00">should be 45 < x < 55</span>)
*Per Sequence Quality Score
+
* Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>).
*Per Base Sequence Content
 
*Per Base GC Content
 
*Per Base N Content
 
*Sequence Length Distribution
 
*Sequence Duplication Levels
 
*Overrepresented Sequences
 
*Kmer Content
 
 
 
=== Convert Quality Scores to Sanger Standard ===
 
 
 
Base calling qualities are represented in FastQ files as an ASCII representation of [http://en.wikipedia.org/wiki/Phred_quality_score Phred-scaled] quality scores.  Different applications have used different transformations to generate ASCII codes for quality scores.  Most if not all aligners interpret ASCII representations of the quality score based on the Sanger Standard.  The Basic Statistics section of the FastQC output will provide an 'Encoding' value for your file.  If this value is anything other than 'Sanger / Illumina 1.9', then you will either need to map the quality values in your FastQ files to the Sanger standard, or use a command line option to your aligner to let it know the correct encoding.
 
 
 
=== Other Possible Analyses ===
 
 
 
Several other tools exist that can provide custom analyses of FastQ files - you may want to consider the following:
 
 
 
* fastq_tool:  This is a perl script written and maintained by Barry.  Most of the functionality of this tool has now been included in faster packages like FastQC above.  However, the script is fast enough to burn through a large Fastq file in a couple of hours, so if you need to write a custom analysis quickly and don't have a lot of files to process this may still be the path of least resistance.  The script is included in the GAL library.
 
* [http://hannonlab.cshl.edu/fastx_toolkit/commandline.html FASTX] is an application written by the Hannon Lab at CSHL.  It provides a number of tools for calculating quality metrics and manipulating Fastq files.
 
  
 
== Alignment ==
 
== Alignment ==
  
=== Which Aligner to use? ===
+
Align reads to the genome with bwa.
 
 
*[http://bio-bwa.sourceforge.net/ bwa]: Written and maintained by Heng Li while with Richard Durbin's group at the Sanger Institute.  It has become the defacto standard aligner for NGS-based variant discovery pipelines due in large part to it's leading role in analyses by the 1000 Genomes project.  It will be the aligner described below, however the merits of the other aligners should be carefully considered.
 
*[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml Bowtie2]: Bowtie2 is developed and maintained by Steve Salsberg's group at Johns Hopkins University.  It is the major competitor to bwa and has developed a strong following.
 
*[http://bioinformatics.bc.edu/marthlab/Mosaik Mosaic]: Written and maintained by Gabor Marth's group at Boston College, Mosaic produces gapped alignments using the Smith-Waterman algorithm which means that it should do a better job at indel discovery and prehaps do better at avoiding false-positive SNV calls in the vicinity of indels.
 
  
=== What reference assembly to use? ===
+
Index the reference sequence if this has not already been done
  
The first consideration when aligning NGS data with any aligner is, "What reference sequence will the reads be aligned to?".  The answer is of course, "Use the current human reference assembly",  which at the time of this writing was [http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/ GRCh37.p12] (also referred to as hg19 by the UCSC Genome Browser).  However, it is important to include decoy sequences with the reference assembly because they can prevent reads that represent sequence not found in the reference assembly (retro-viral sequence, unanchored contigs, alternate haplotypes etc) from mis-aligning to the reference and creating false positive variant calls.  The 1000 Genomes project has put together such a baited reference assembly Fasta file. This is the primary file to use for alignments - particularly if the 1000 Genomes data will be used for comparison later on in the pipeline.  The file, as well as a document describing it's construction can be found with the 1KG technical reference data at:
+
  bwa index -a bwtsw human_g1K_v37.fasta
* [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz  hs37d5.fa.gz]
 
* [ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/README_human_reference_20110707 README_human_reference_20110707]
 
  
=== Alignment of Illumina HiSeq PE 101 bp sequences with bwa ===
+
The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.
  
The first step is to index the reference sequence.
+
bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair
 +
bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
  
  bwa index -a bwtsw $reference_fasta
+
The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
  
The first alignment step will find the reference coordinates of the input reads (independent of their mate-pair). The bwa aligner has many options which can control various aspects of the alignment. For the most-part we use default parameters for the alignment and this will likely suffice for most circumstances, however the command line parameters should be carefully considered at the start of each project. See the [http://bio-bwa.sourceforge.net/bwa.shtml bwa manual page] and the [[Bwa | Yandbeck Lab Wiki page on bwa] for guidance in choosing appropriate command-line parameters. The following parameters are those used by the 1KG project for aligning Illumina data.
+
  bwa sampe -P human_g1K_v37.fasta Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -
  
bwa aln -q 15 reference.fasta file.fastq > output.sai
+
<span style="background:#FFFF00">This will switch to bwa mem soon.
  
Next create SAM files using bwa sampe. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
+
bwa mem human_g1K_v37.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -
  
bwa sampe -P reference.fasta output1.sai output2.sai file1.fastq file2.fastq > out.sam
+
</span>
  
Command Line Options:
+
Bwa Command Line Options:
  
 
* -n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
 
* -n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
Line 132: Line 96:
 
* -f FILE  file to write output to instead of stdout
 
* -f FILE  file to write output to instead of stdout
  
The 1KG pipeline has the following:
+
The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files
 
 
samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp |        \
 
samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | \
 
samtools fillmd -u - reference.fasta > fixed_sorted.bam
 
 
 
However GATK takes care of both the BAQ calculation and the fixmate step, so the only thing needed now is adding read groups and sorting.  It's also best to switch to Broad tools (Picard/GATK) now if you plan to use them downstream, because subtle incompatibilities of BAM bewteen samtools and Picard/GATK can bite you downstream.
 
  
  nohup java -d64 -Xmx8g -jar /usr/local/picard-tools-1.88/AddOrReplaceReadGroups.jar \
+
  java -Xmx4g -jar picard-tools/AddOrReplaceReadGroups.jar \
     INPUT=9958X1_7_1.bam                                                           \
+
     INPUT=Sample1_L1.bam                                             \
     OUTPUT=9958X1_RG.bam                                                           \
+
     OUTPUT=Sample1_L1.rg.bam                                         \
     CREATE_INDEX=TRUE                                                               \
+
     CREATE_INDEX=TRUE                                               \
    MAX_RECORDS_IN_RAM=2000000                                                      \
+
     VALIDATION_STRINGENCY=LENIENT                                    \
     VALIDATION_STRINGENCY=SILENT                                                    \
+
     SORT_ORDER=coordinate                                           \
    TMP_DIR=/tmp                                                                    \
+
     RGID=9958X1                                                     \
     SORT_ORDER=coordinate                                                           \
+
     RGLB=9958X1                                                     \
     RGID=9958X1                                                                     \
+
     RGPL=illumina                                                   \
     RGLB=9958X1                                                                     \
+
     RGPU=1                                                           \
     RGPL=illumina                                                                   \
+
    TMP_DIR=/big/drive                                              \
     RGPU=1                                                                         \
+
     2> 9958X1_AddOrReplaceReadGroups.error
     2> 9958X1_AddOrReplaceReadGroups.error &
 
  
 
== BAM File Analyses ==
 
== BAM File Analyses ==
  
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling.
+
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
 
 
TODO:  Should probably merge all lanes per individual here so that MarkDuplicates and IndelRealigner does their best work.
 
  
 
=== Merge lane level BAMs to individual ===  
 
=== Merge lane level BAMs to individual ===  
  
  java -jar -Xmx4g /tools/picard-tools-1.90/MergeSamFiles.jar \                                                                                            
+
  java -Xmx4g -jar picard-tools/MergeSamFiles.jar \
     INPUT=alignment.bam         \                                                                                                                        
+
     INPUT=Sample1_L1.rg.bam                     \
     INPUT=alignment.bam         \                                                                                                                        
+
     INPUT=Sample1_L2.rg.bam                     \                                                                                                
     OUTPUT=merged.bam           \                                                                                                                        
+
     OUTPUT=Sample1.bam                         \
     CREATE_INDEX=TRUE           \                                                                                                                        
+
     CREATE_INDEX=TRUE                           \
     ASSUME_SORTED=true           \                                                                                                                        
+
     ASSUME_SORTED=true                         \
     USE_THREADING=true           \                                                                                                                        
+
     USE_THREADING=true                         \
     TMP_DIR=/tmp                \                                                                                                                        
+
     VALIDATION_STRINGENCY=SILENT                \
     MAX_RECORDS_IN_RAM=30000000  \                                                                                                                        
+
     TMP_DIR=/big/drive                          \
     SORT_ORDER=coordinate        \                                                                                                                       
+
     2> MergeSameFiles.error
    VALIDATION_STRINGENCY=SILENT
 
  
=== GATK Local Realignment of Indels ===
+
=== Mark Duplicates ===
  
java -Xmx10g -jar /usr/local/GenomeAnalysisTK-2.4-7-g5e89f01//GenomeAnalysisTK.jar                        \
+
Remove PCR/Optical duplicate reads
    -I all_bams.list                              \
 
    -T RealignerTargetCreator                      \
 
    -R fasta/hs_ref_GRCh37.p10_ALL_chr_hs37d5.fa   \
 
    -o realign.intervals                              \
 
    -known known_indels/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf \
 
    -known known_indels/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf                                \
 
    -nt 24                                                                                                \
 
    >  RealignerTargetCreator                                                                              \
 
    2> RealignerTargetCreator.error &
 
  
  java -Xmx10g -jar /usr/local/GenomeAnalysisTK-2.4-7-g5e89f01/GenomeAnalysisTK.jar                         \
+
  java -Xmx4g -jar MarkDuplicates.jar   \
    -T IndelRealigner                                                                                      \
+
     INPUT=Sample1.bam                   \
    -R fasta/hs_ref_GRCh37.p10_ALL_chr_hs37d5.fa                                                          \
+
     OUTPUT=Sample1.dedup.bam           \
     -I 9958X1_RG.bam                                                                                       \
+
     METRICS_FILE=lane1_dup_metrics.txt  \
     -I 9958X2_RG.bam                                                                                       \
+
     ASSUME_SORTED=true                  \
     -I 9958X3_RG.bam                                                                                      \
+
     USE_THREADING=true                  \
     -I 9958X4_RG.bam                                                                                      \
+
     VALIDATION_STRINGENCY=SILENT        \
     -targetIntervals realign.intervals                                                                    \
+
     TMP_DIR=/big/drive                  \
     -o 9958X_Realigned.bam                                                                                \
+
     2> MarkDuplicates.error
     -known known_indels/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf \
 
    -known known_indels/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf                                \
 
    >  IndelRealigner.out                                                                                  \
 
     2> IndelRealigner.error
 
  
Or this ??:
+
<span style="background:#FFFF00">Develop range for duplicate levels here</span>
java -Djava.io.tmpdir=/local -Xmx8g -jar GenomeAnalysisTK.jar \
 
    -l INFO -T IndelRealigner \
 
    -U ALLOW_UNINDEXED_BAM    \
 
    -I /output/filename.b37_1kg.dedup.bam \
 
    -targetIntervals /resources//hg19/intervals/realign_intervals_hg19_b37_1kg.intervals \
 
    -R /resources//hg19/indices/b37_1kg.fa \
 
    -D /resources//hg19/dbsnp/dbsnp_129_b37_b37_1kg.rod \
 
    -[B:indels,VCF B:indels,VCF] /resources//hg19/indels/1kg.pilot_release.merged.indels.sites./hg19.b37_1kg.vcf \
 
    -o /output/filename.b37_1kg.realigned.bam
 
    -knownsOnly
 
    -LOD 0.4
 
    -compress 0
 
  
Known indels are from the following files:
+
=== Local Realignment of Indels ===
  
*ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz
+
java -Xmx4g -jar GenomeAnalysisTK.jar                  \
*ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz
+
    -I All_BAMs.list                                    \
 +
    -o Realign_Intervals.txt                            \
 +
    -T RealignerTargetCreator                          \
 +
    -R human_g1K_v37.fasta                              \
 +
    -known Mills_and_1000G_gold_standard.indels.b37.vcf \
 +
    -known 1000G_phase1.indels.b37.vcf                 \
 +
    -nt 24                                              \
 +
    2> RealignerTargetCreator.error
  
=== Fix Mate Information ===
+
java -Xmx4g -jar GenomeAnalysisTK.jar                  \
 +
    -T IndelRealigner                                  \
 +
    -I Sample1.dedup.bam                                \
 +
    -o Sample1.realign.bam                              \
 +
    -R human_g1K_v37.fasta                              \
 +
    -targetIntervals Realign_Intervals.txt              \
 +
    -known Mills_and_1000G_gold_standard.indels.b37.vcf \
 +
    2> IndelRealigner.error
  
java -jar -Xmx6g /tools/picard-tools-1.32/FixMateInformation.jar \
+
=== Base Quality Score Recalibration ===
    INPUT=/output/filename.b37_1kg.realigned.bam \
 
    OUTPUT=/output/filename.b37_1kg.matefixed.bam \
 
    SORT_ORDER=coordinate                      \
 
    VALIDATION_STRINGENCY=SILENT                \
 
    TMP_DIR=/local
 
  
=== GATK Quality Score Recalibration ===
+
java -Xmx4g -jar GenomeAnalysisTK.jar \
 +
  -T PrintReads                      \
 +
  -I Sample1.realign.bam              \
 +
  -o Sample1.recal.bam                \
 +
  -R human_g1K_v37.fasta              \
 +
  -BQSR recalibration_report.grp      \
  
*CountCovariates and TableRecalibration are no longer options in GATK (http://gatkforums.broadinstitute.org/discussion/1248/countcovariates).
 
*Use BaseRecalibrator for both of the below step now (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html)
 
 
java [jvm_args] -jar GenomeAnalysisTK.jar                                \
 
      -T CountCovariates                                                  \
 
      -R $reference_fasta                                                  \
 
      -I $realigned_bam_file                                                \
 
      -recalFile recal_data.csv                                            \
 
      -knownSites $known_sites_file(s)                                      \
 
      -l INFO                                                              \
 
      -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' \
 
      -cov ReadGroupCovariate                                              \
 
      -cov QualityScoreCovariate                                          \
 
      -cov CycleCovariate                                                  \
 
      -cov DinucCovariate                                                  \
 
 
java [jvm_args] -jar GenomeAnalysisTK.jar \
 
      -T TableRecalibration                \
 
      -R $reference_fasta                  \
 
      -recalFile recal_data.csv            \
 
      -I $realigned_bam_file                \
 
      -o $recalibrated_bam_file            \
 
      -l INFO                              \
 
      -compress 0                          \
 
      -disable_bam_indexing                \
 
 
 
Known sites for recalibration are from dbSNP135:
 
 
* ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.dbsnp.build135.snps.sites.vcf.gz
 
 
Sort and index recalibrated alignment (~5h)
 
 
java -jar -Xmx3g /tools/picard-tools-1.32/SortSam.jar \
 
    INPUT=/output/filename.b37_1kg.recal.bam            \
 
    OUTPUT=/output/filename.b37_1kg.recal.sorted.bam    \
 
    SORT_ORDER=coordinate                              \
 
    VALIDATION_STRINGENCY=LENIENT                      \
 
    TMP_DIR=/local
 
 
java -jar -Xmx3g /tools/picard-tools-1.32/BuildBamIndex.jar \
 
    INPUT=/output/filename.b37_1kg.recal.sorted.bam          \
 
    OUTPUT=/output/filename.b37_1kg.recal.sorted.bam.bai      \
 
    VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
 
 
Calculate covariates after realignment and recalibration
 
 
java -jar -Xmx2g GenomeAnalysisTK.jar                      \
 
    -l INFO                                                \
 
    -T CountCovariates                                      \
 
    -U ALLOW_UNINDEXED_BAM                                  \
 
    -R /resources//hg19/indices/b37_1kg.fa                  \
 
    --DBSNP /resources//hg19/dbsnp/dbsnp_129_b37_b37_1kg.rod \
 
    -I /output/filename.b37_1kg.recal.sorted.bam            \
 
    -cov ReadGroupcovariate                                \
 
    -cov QualityScoreCovariate                              \
 
    -cov CycleCovariate                                    \
 
    -cov DinucCovariate                                    \
 
    -recalFile /output/filename.b37_1kg.recal.covariate_table.csv
 
 
Analyze covariates before and after
 
 
java -jar -Xmx4g AnalyzeCovariates.jar                            \
 
    -l INFO                                                        \
 
    -resources /resources//hg19/indices/b37_1kg.fa                  \
 
    --recal_file /output/filename.b37_1kg.matefixed.covariate_table.csv \
 
    -outputDir /output/filename.b37_1kg.recal.stats_before/          \
 
    -Rscript ${rscript}
 
    -ignoreQ 5
 
 
java -jar -Xmx4g AnalyzeCovariates.jar                          \
 
    -l INFO                                                      \
 
    -resources /resources//hg19/indices/b37_1kg.fa              \
 
    --recal_file /output/filename.b37_1kg.recal.covariate_table.csv \
 
    -outputDir /output/filename.b37_1kg.recal.stats_after/        \
 
    -Rscript ${rscript}                                          \
 
    -ignoreQ 5
 
 
=== Generate the SAM MD Tag ===
 
 
The MD tag in SAM files provides a string representation of mismatching positions.  The CIGAR string used in SAM files does not provide full detail of mismatch positions.  The samtools calmd command is used to generate MD tags as well as which fix the NM tags (Edit distance to the reference) and introduce the BQ tags (Base Alignment Quality) which can be used during variant calling.
 
 
samtools calmd -Erb recalibrated_file.bam $reference.fasta > $bq_file.bam
 
 
=== Strip Extraneous BAM Tags ===
 
 
Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them, to reduce total file size by around 30%.
 
 
GATK ReduceReads ???
 
 
=== Merge BAMS ===
 
 
Merge all BAMS for a given individual
 
 
java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT
 
 
=== Mark PCR Duplicates ===
 
 
[http://picard.sourceforge.net/ Picard] MarkDuplicates is used to mark PCR Duplicates
 
 
java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT
 
 
java -Xmx4g -jar /tools/picard-tools-1.32/MarkDuplicates.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ OUTPUT=/output/filename.b37_1kg.dedup.bam \ METRICS_FILE=/output/filename.b37_1kg.dedup.metrics \ REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
 
 
=== Merge All BAMS for the Project ===
 
 
This happens above... Picard MergeSamFiles is used to Merge all the SAM files for a given project.
 
 
java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT
 
  
 +
<span style="background:#FFFF00">
 
=== BAM Quality Control ===
 
=== BAM Quality Control ===
  
* Picard Tools
+
<span style="background:#FFFF00">
** BamIndexStats
 
** CalculateHsMetrics ??
 
** CleanSam
 
** CollectAlignmentSummaryMetrics
 
** CollectGcBiasMetrics
 
** CollectInsertSizeMetrics
 
** CollectMultipleMetrics
 
** CollectTargetedPcrMetrics
 
** EstimateLibraryComplexity
 
** FixMateInformation
 
** MarkDuplicates
 
** MeanQualityByCycle
 
** QualityScoreDistribution
 
** ValidateSamFile
 
* GATK QC
 
*[http://genome.sph.umich.edu/wiki/Software]
 
*[http://wangzhengyuan.blogspot.com/2012/12/a-tool-for-bam-file-qc-qualimap.html]
 
 
 
 
CollectAlignmentSummaryMetrics
 
CollectAlignmentSummaryMetrics
  java -jar -Xmx4g /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \
 
     I=/output/filename.b37_1kg.sorted.bam                                      \
 
     I=/output/filename.b37_1kg.sorted.bam                                      \
 
     O=/output/filename.b37_1kg.AlignmentSummaryMetrics                        \
 
     O=/output/filename.b37_1kg.AlignmentSummaryMetrics                        \
     R=/resources//hg19/indices/b37_1kg.fa                                    \
+
     R=human_g1K_v37.fasta                                                    \
 
     VALIDATION_STRINGENCY=LENIENT                                            \
 
     VALIDATION_STRINGENCY=LENIENT                                            \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
CollectGcBiasMetrics
 
CollectGcBiasMetrics
  java -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \
     R=/resources//hg19/indices/b37_1kg.fa                    \
+
     R=human_g1K_v37.fasta                                          \
     I=/output/filename.b37_1kg.sorted.bam                     \
+
     I=/output/filename.b37_1kg.sorted.bam                           \
     O=/output/filename.b37_1kg.GcBiasMetrics                 \
+
     O=/output/filename.b37_1kg.GcBiasMetrics                         \
     CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf         \
+
     CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf                 \
     VALIDATION_STRINGENCY=LENIENT                           \
+
     VALIDATION_STRINGENCY=LENIENT                                   \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
CollectInsertSizeMetrics
 
CollectInsertSizeMetrics
  java -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \
     I=/output/filename.b37_1kg.sorted.bam                         \
+
     I=/output/filename.b37_1kg.sorted.bam                               \
     O=/output/filename.b37_1kg.CollectInsertSizeMetrics           \
+
     O=/output/filename.b37_1kg.CollectInsertSizeMetrics                 \
     H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf       \
+
     H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf             \
     VALIDATION_STRINGENCY=LENIENT                               \
+
     VALIDATION_STRINGENCY=LENIENT                                       \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
MeanQualityByCycle
 
MeanQualityByCycle
  java -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \
     I=/output/filename.b37_1kg.sorted.bam                   \
+
     I=/output/filename.b37_1kg.sorted.bam                         \
     O=/output/filename.b37_1kg.MeanQualityByCycle           \
+
     O=/output/filename.b37_1kg.MeanQualityByCycle                 \
     CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf   \
+
     CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf         \
     VALIDATION_STRINGENCY=LENIENT                         \
+
     VALIDATION_STRINGENCY=LENIENT                                 \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
QualityScoreDistribution
 
QualityScoreDistribution
  java -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar       \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar \
 
     I=/output/filename.b37_1kg.sorted.bam                                \
 
     I=/output/filename.b37_1kg.sorted.bam                                \
 
     O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution]            \
 
     O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution]            \
 
     CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf    \
 
     CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf    \
 
     VALIDATION_STRINGENCY=LENIENT                                        \
 
     VALIDATION_STRINGENCY=LENIENT                                        \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
BamIndexStats
 
BamIndexStats
  java -jar /tools/picard-tools-1.32/BamIndexStats.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \
     INPUT=/output/filename.b37_1kg.sorted.bam         \
+
     INPUT=/output/filename.b37_1kg.sorted.bam                 \
     VALIDATION_STRINGENCY=LENIENT                     \
+
     VALIDATION_STRINGENCY=LENIENT                           \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
  
 +
<span style="background:#FFFF00">
 
CalculateHsMetricsWholeGenome
 
CalculateHsMetricsWholeGenome
  java -jar -Xmx3g /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \
+
  java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \
 
     INPUT=/output/filename.b37_1kg.sorted.bam                                \
 
     INPUT=/output/filename.b37_1kg.sorted.bam                                \
 
     OUTPUT=/output/filename.b37_1kg.HsMetrics                                \
 
     OUTPUT=/output/filename.b37_1kg.HsMetrics                                \
Line 416: Line 239:
 
     TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list          \
 
     TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list          \
 
     VALIDATION_STRINGENCY=LENIENT                                            \
 
     VALIDATION_STRINGENCY=LENIENT                                            \
     TMP_DIR=/local
+
     TMP_DIR=/big/drive
 +
</span>
  
=== Split BAMS by Chromosome ===
+
=== Reduce Reads ===
  
BAMs are split into chromosomes BAMs. These files move on to variant calling.
+
java -Xmx4g -jar GenomeAnalysisTK.jar \
 
+
  -T ReduceReads                      \
=== Compress BAMS with ReduceRead ===
+
  -I Sample1.recal.bam                \
 +
  -o Sample1.reduced.bam              \
 +
  -R human_g1K_v37.fasta              \
 +
  2> ReduceReads.error
  
 
== Variant Calling ==
 
== Variant Calling ==
  
 
=== UnifiedGenotyper ===
 
=== UnifiedGenotyper ===
  java -jar GenomeAnalysisTK.jar \
+
  java -Xmx4g -jar GenomeAnalysisTK.jar \
   -R resources/Homo_sapiens_assembly18.fasta \
+
   -T UnifiedGenotyper                  \
   -T UnifiedGenotyper \
+
   -I Sample1.reduced.bam              \
   -I sample1.bam [-I sample2.bam ...] \
+
   -I Sample2.reduced.bam             \
   --dbsnp dbSNP.vcf \
+
  -I Background1.reduced.bam         \
   -o snps.raw.vcf \
+
   -I Background2.reduced.bam          \
   -stand_call_conf [50.0] \
+
   -o Sample1.raw.vcf                 \
   -stand_emit_conf 10.0 \
+
   -R human_g1K_v37.fasta              \
   -dcov [50 for 4x, 200 for >30x WGS or Whole exome] \
+
   --dbsnp dbSNP.vcf                    \
   [-L targets.interval_list]
+
   -dcov 250                          \
+
   -L TruSeq_exome_regions.bed        \
 
+
  2> UnifiedGenotyper.error
  
 
=== VariantRecalibrator ===
 
=== VariantRecalibrator ===
 
  java -Xmx4g -jar GenomeAnalysisTK.jar \
 
  java -Xmx4g -jar GenomeAnalysisTK.jar \
   -T VariantRecalibrator \
+
   -T VariantRecalibrator             \
   -R reference/human_g1k_v37.fasta \
+
   -R human_g1K_v37.fasta             \
   -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
+
   -input Sample1.raw.vcf             \
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \
+
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf         \
   -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \
+
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf       \
   -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \
+
   -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf           \
  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
 
  -mode SNP \
 
  -recalFile path/to/output.recal \
 
  -tranchesFile path/to/output.tranches \
 
  -rscriptFile path/to/output.plots.R
 
 
=== ApplyRecalibration ===
 
java -Xmx3g -jar GenomeAnalysisTK.jar \
 
  -T ApplyRecalibration \
 
  -R reference/human_g1k_v37.fasta \
 
  -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \
 
  --ts_filter_level 99.0 \
 
  -tranchesFile path/to/output.tranches \
 
  -recalFile path/to/output.recal \
 
  -mode SNP \
 
  -o path/to/output.recalibrated.filtered.vcf
 
  
== Variant File Analyses ==
+
ADD 1KG
  
== Family Based Analyses ==
+
  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
 +
  -mode SNP                          \
 +
  -recalFile Sample1.recal            \
 +
  -tranchesFile Sample1.tranches      \
 +
  -rscriptFile Sample1.plots.R        \
 +
  2> VariantRecalibrator.error
  
=== IBD/SGS ===
+
=== ApplyRecalibration ===
 
+
java -Xmx4g -jar GenomeAnalysisTK.jar \
=== Linkage ===
+
  -T ApplyRecalibration              \
 
+
  -input Sample1.raw.vcf              \
== Population Based Analyses ==
+
  -o Sample1.vqsr.vcf                \
 
+
  -R human_g1K_v37.fasta              \
=== Phasing ===
+
  --ts_filter_level 99.0                \
 
+
  -tranchesFile Sample1.tranches      \
=== FST ===
+
  -recalFile Sample1.recal            \
 
+
  -mode SNP                          \
== Phenotype Based Analyses ==
+
  2> ApplyRecalibration.error
  
=== GWAS ===
+
== Variant File QC ==
  
==== VAAST ====
+
<span style="background:#FFFF00">
 +
Quality Metrics on variants
 +
*Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
 +
*HapMap concordance
 +
*SNV/Indel Counts
 +
*Rare variant enrichment
 +
*DP
 +
*Q
 +
*GQ

Latest revision as of 19:04, 6 September 2013

Utah Genome Project

Variant Calling Pipeline Version 0.0.2

Sept. 2013

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.5

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
  • Reference Genome (GRCh37):
    • human_g1K_v37.fasta
  • High confidence SNVs
    • 1000G_omni2.5.b37.vcf
    • hapmap_3.3.b37.vcf
  • dbSNP
    • dbsnp_137.b37.vcf.gz
  • Known indels
    • Mills_and_1000G_gold_standard.indels.b37.vcf.gz
    • 1000G_phase1.indels.b37.vcf

Sequencing

This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding.

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5

If the md5sum signature differs from that provided for the file:

  • Check to be sure you have the correct file.
  • Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
  • Try the download again.
  • Contact the sequence provider.

FastQ File Analyses

fastqc Sample1_L1_R1.txt

From the fastqc_data.txt file we check the following values:

  • Encoding (must be Sanger / Illumina 1.9)
  • Total Sequences (Need to develop a acceptable range)
  • Filtered Sequences (Should be 0 or at least very low)
  • Sequence length (must be ~ 100 bp)
  • %GC (should be 45 < x < 55)
  • Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).

Alignment

Align reads to the genome with bwa.

Index the reference sequence if this has not already been done

bwa index -a bwtsw human_g1K_v37.fasta

The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.

bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair
bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair

The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.

bwa sampe -P human_g1K_v37.fasta Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -

This will switch to bwa mem soon.

bwa mem human_g1K_v37.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -

Bwa Command Line Options:

  • -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
  • -o INT maximum number or fraction of gap opens [1]
  • -e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
  • -i INT do not put an indel within INT bp towards the ends [5]
  • -d INT maximum occurrences for extending a long deletion [10]
  • -l INT seed length [32]
  • -k INT maximum differences in the seed [2]
  • -m INT maximum entries in the queue [2000000]
  • -t INT number of threads [1]
  • -M INT mismatch penalty [3]
  • -O INT gap open penalty [11]
  • -E INT gap extension penalty [4]
  • -R INT stop searching when there are >INT equally best hits [30]
  • -q INT quality threshold for read trimming down to 35bp [0]
  • -c input sequences are in the color space
  • -L log-scaled gap penalty for long deletions
  • -N non-iterative mode: search for all n-difference hits (slooow)
  • -f FILE file to write output to instead of stdout

The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files

java -Xmx4g -jar picard-tools/AddOrReplaceReadGroups.jar \
   INPUT=Sample1_L1.bam                                             \
   OUTPUT=Sample1_L1.rg.bam                                         \
   CREATE_INDEX=TRUE                                                \
   VALIDATION_STRINGENCY=LENIENT                                    \
   SORT_ORDER=coordinate                                            \
   RGID=9958X1                                                      \
   RGLB=9958X1                                                      \
   RGPL=illumina                                                    \
   RGPU=1                                                           \
   TMP_DIR=/big/drive                                               \
   2> 9958X1_AddOrReplaceReadGroups.error

BAM File Analyses

Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.

Merge lane level BAMs to individual

java -Xmx4g -jar picard-tools/MergeSamFiles.jar \
    INPUT=Sample1_L1.rg.bam                     \
    INPUT=Sample1_L2.rg.bam                     \                                                                                                 
    OUTPUT=Sample1.bam                          \
    CREATE_INDEX=TRUE                           \
    ASSUME_SORTED=true                          \
    USE_THREADING=true                          \
    VALIDATION_STRINGENCY=SILENT                \
    TMP_DIR=/big/drive                          \
    2> MergeSameFiles.error

Mark Duplicates

Remove PCR/Optical duplicate reads

java -Xmx4g -jar MarkDuplicates.jar    \
   INPUT=Sample1.bam                   \
   OUTPUT=Sample1.dedup.bam            \
   METRICS_FILE=lane1_dup_metrics.txt  \
   ASSUME_SORTED=true                  \
   USE_THREADING=true                  \
   VALIDATION_STRINGENCY=SILENT        \
   TMP_DIR=/big/drive                  \
   2> MarkDuplicates.error

Develop range for duplicate levels here

Local Realignment of Indels

java -Xmx4g -jar GenomeAnalysisTK.jar                  \
   -I All_BAMs.list                                    \
   -o Realign_Intervals.txt                            \
   -T RealignerTargetCreator                           \
   -R human_g1K_v37.fasta                              \
   -known Mills_and_1000G_gold_standard.indels.b37.vcf \
   -known 1000G_phase1.indels.b37.vcf                  \
   -nt 24                                              \
   2> RealignerTargetCreator.error
java -Xmx4g -jar GenomeAnalysisTK.jar                  \
   -T IndelRealigner                                   \
   -I Sample1.dedup.bam                                \
   -o Sample1.realign.bam                              \
   -R human_g1K_v37.fasta                              \
   -targetIntervals Realign_Intervals.txt              \
   -known Mills_and_1000G_gold_standard.indels.b37.vcf \
   2> IndelRealigner.error

Base Quality Score Recalibration

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T PrintReads                       \
  -I Sample1.realign.bam              \
  -o Sample1.recal.bam                \
  -R human_g1K_v37.fasta              \
  -BQSR recalibration_report.grp      \


BAM Quality Control

CollectAlignmentSummaryMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \
   I=/output/filename.b37_1kg.sorted.bam                                      \
   O=/output/filename.b37_1kg.AlignmentSummaryMetrics                         \
   R=human_g1K_v37.fasta                                                     \
   VALIDATION_STRINGENCY=LENIENT                                             \
   TMP_DIR=/big/drive

CollectGcBiasMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \
   R=human_g1K_v37.fasta                                           \
   I=/output/filename.b37_1kg.sorted.bam                            \
   O=/output/filename.b37_1kg.GcBiasMetrics                         \
   CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf                 \
   VALIDATION_STRINGENCY=LENIENT                                   \
   TMP_DIR=/big/drive

CollectInsertSizeMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \
   I=/output/filename.b37_1kg.sorted.bam                                \
   O=/output/filename.b37_1kg.CollectInsertSizeMetrics                  \
   H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf              \
   VALIDATION_STRINGENCY=LENIENT                                       \
   TMP_DIR=/big/drive

MeanQualityByCycle

java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \
   I=/output/filename.b37_1kg.sorted.bam                          \
   O=/output/filename.b37_1kg.MeanQualityByCycle                  \
   CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf          \
   VALIDATION_STRINGENCY=LENIENT                                 \
   TMP_DIR=/big/drive

QualityScoreDistribution

java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar  \
   I=/output/filename.b37_1kg.sorted.bam                                 \
   O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution]            \
   CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf    \
   VALIDATION_STRINGENCY=LENIENT                                        \
   TMP_DIR=/big/drive

BamIndexStats

java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \
   INPUT=/output/filename.b37_1kg.sorted.bam                 \
   VALIDATION_STRINGENCY=LENIENT                            \
   TMP_DIR=/big/drive

CalculateHsMetricsWholeGenome

java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \
   INPUT=/output/filename.b37_1kg.sorted.bam                                 \
   OUTPUT=/output/filename.b37_1kg.HsMetrics                                 \
   BAIT_INTERVALS=/resources//hg19/intervals/GoNL.interval_list             \
   TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list           \
   VALIDATION_STRINGENCY=LENIENT                                            \
   TMP_DIR=/big/drive

Reduce Reads

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T ReduceReads                      \
  -I Sample1.recal.bam                \
  -o Sample1.reduced.bam              \
  -R human_g1K_v37.fasta              \
  2> ReduceReads.error

Variant Calling

UnifiedGenotyper

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper                  \
  -I Sample1.reduced.bam              \
  -I Sample2.reduced.bam              \
  -I Background1.reduced.bam          \
  -I Background2.reduced.bam          \
  -o Sample1.raw.vcf                  \
  -R human_g1K_v37.fasta              \
  --dbsnp dbSNP.vcf                    \
  -dcov 250                           \
  -L TruSeq_exome_regions.bed         \
  2> UnifiedGenotyper.error

VariantRecalibrator

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T VariantRecalibrator              \
  -R human_g1K_v37.fasta              \
  -input Sample1.raw.vcf              \
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf          \
  -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf        \
  -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf            \

ADD 1KG

  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
  -mode SNP                           \
  -recalFile Sample1.recal            \
  -tranchesFile Sample1.tranches      \
  -rscriptFile Sample1.plots.R        \
  2> VariantRecalibrator.error

ApplyRecalibration

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T ApplyRecalibration               \
  -input Sample1.raw.vcf              \
  -o Sample1.vqsr.vcf                 \
  -R human_g1K_v37.fasta              \
  --ts_filter_level 99.0                \
  -tranchesFile Sample1.tranches      \
  -recalFile Sample1.recal            \
  -mode SNP                           \
  2> ApplyRecalibration.error

Variant File QC

Quality Metrics on variants

  • Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
  • HapMap concordance
  • SNV/Indel Counts
  • Rare variant enrichment
  • DP
  • Q
  • GQ