Difference between revisions of "UGP Variant Pipeline 0.0.2"
(21 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Utah Genome Project = | = Utah Genome Project = | ||
− | == Variant Calling Pipeline 0.0.2 == | + | == Variant Calling Pipeline Version 0.0.2 == |
== Sept. 2013 == | == Sept. 2013 == | ||
Line 10: | Line 10: | ||
* Reference Genome (GRCh37): | * Reference Genome (GRCh37): | ||
− | + | **human_g1K_v37.fasta | |
* High confidence SNVs | * High confidence SNVs | ||
− | + | **1000G_omni2.5.b37.vcf | |
− | + | **hapmap_3.3.b37.vcf | |
* dbSNP | * dbSNP | ||
− | + | **dbsnp_137.b37.vcf.gz | |
− | * | + | * Known indels |
− | + | **Mills_and_1000G_gold_standard.indels.b37.vcf.gz | |
+ | **1000G_phase1.indels.b37.vcf | ||
== Sequencing == | == Sequencing == | ||
Line 41: | Line 42: | ||
== FastQ File Analyses == | == FastQ File Analyses == | ||
− | fastqc | + | fastqc Sample1_L1_R1.txt |
From the fastqc_data.txt file we check the following values: | From the fastqc_data.txt file we check the following values: | ||
Line 48: | Line 49: | ||
* Filtered Sequences (Should be 0 or at least very low) | * Filtered Sequences (Should be 0 or at least very low) | ||
* Sequence length (must be ~ 100 bp) | * Sequence length (must be ~ 100 bp) | ||
− | * %GC (should be 45 < x < 55) | + | * %GC (<span style="background:#FFFF00">should be 45 < x < 55</span>) |
* Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>). | * Total Duplicate Percentage (<span style="background:#FFFF00">This value may not be valuable - an acceptable range has not been determined</span>). | ||
== Alignment == | == Alignment == | ||
+ | Align reads to the genome with bwa. | ||
Index the reference sequence if this has not already been done | Index the reference sequence if this has not already been done | ||
− | bwa index -a bwtsw | + | bwa index -a bwtsw human_g1K_v37.fasta |
− | The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data | + | The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data. |
− | bwa aln -q 15 | + | bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair |
+ | bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair | ||
The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size. | The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size. | ||
− | bwa sampe -P | + | bwa sampe -P human_g1K_v37.fasta Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 - |
+ | |||
+ | <span style="background:#FFFF00">This will switch to bwa mem soon. | ||
+ | |||
+ | bwa mem human_g1K_v37.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 - | ||
+ | |||
+ | </span> | ||
Bwa Command Line Options: | Bwa Command Line Options: | ||
Line 89: | Line 98: | ||
The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files | The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files | ||
− | + | java -Xmx4g -jar picard-tools/AddOrReplaceReadGroups.jar \ | |
− | INPUT= | + | INPUT=Sample1_L1.bam \ |
− | OUTPUT= | + | OUTPUT=Sample1_L1.rg.bam \ |
− | CREATE_INDEX=TRUE | + | CREATE_INDEX=TRUE \ |
− | + | VALIDATION_STRINGENCY=LENIENT \ | |
− | VALIDATION_STRINGENCY= | + | SORT_ORDER=coordinate \ |
− | + | RGID=9958X1 \ | |
− | SORT_ORDER=coordinate | + | RGLB=9958X1 \ |
− | RGID=9958X1 | + | RGPL=illumina \ |
− | RGLB=9958X1 | + | RGPU=1 \ |
− | RGPL=illumina | + | TMP_DIR=/big/drive \ |
− | RGPU=1 | + | 2> 9958X1_AddOrReplaceReadGroups.error |
− | 2> 9958X1_AddOrReplaceReadGroups.error | ||
== BAM File Analyses == | == BAM File Analyses == | ||
Line 109: | Line 117: | ||
=== Merge lane level BAMs to individual === | === Merge lane level BAMs to individual === | ||
− | java -jar | + | java -Xmx4g -jar picard-tools/MergeSamFiles.jar \ |
− | INPUT= | + | INPUT=Sample1_L1.rg.bam \ |
− | INPUT= | + | INPUT=Sample1_L2.rg.bam \ |
− | OUTPUT= | + | OUTPUT=Sample1.bam \ |
− | CREATE_INDEX=TRUE | + | CREATE_INDEX=TRUE \ |
− | ASSUME_SORTED=true | + | ASSUME_SORTED=true \ |
− | USE_THREADING=true | + | USE_THREADING=true \ |
− | + | VALIDATION_STRINGENCY=SILENT \ | |
− | + | TMP_DIR=/big/drive \ | |
− | + | 2> MergeSameFiles.error | |
− | |||
− | === | + | === Mark Duplicates === |
− | + | Remove PCR/Optical duplicate reads | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | java - | + | java -Xmx4g -jar MarkDuplicates.jar \ |
− | + | INPUT=Sample1.bam \ | |
− | + | OUTPUT=Sample1.dedup.bam \ | |
− | + | METRICS_FILE=lane1_dup_metrics.txt \ | |
− | + | ASSUME_SORTED=true \ | |
− | + | USE_THREADING=true \ | |
− | + | VALIDATION_STRINGENCY=SILENT \ | |
− | + | TMP_DIR=/big/drive \ | |
− | + | 2> MarkDuplicates.error | |
− | |||
− | |||
− | |||
− | 2> | ||
− | + | <span style="background:#FFFF00">Develop range for duplicate levels here</span> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | === Local Realignment of Indels === | |
− | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ | |
− | + | -I All_BAMs.list \ | |
+ | -o Realign_Intervals.txt \ | ||
+ | -T RealignerTargetCreator \ | ||
+ | -R human_g1K_v37.fasta \ | ||
+ | -known Mills_and_1000G_gold_standard.indels.b37.vcf \ | ||
+ | -known 1000G_phase1.indels.b37.vcf \ | ||
+ | -nt 24 \ | ||
+ | 2> RealignerTargetCreator.error | ||
− | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ | |
+ | -T IndelRealigner \ | ||
+ | -I Sample1.dedup.bam \ | ||
+ | -o Sample1.realign.bam \ | ||
+ | -R human_g1K_v37.fasta \ | ||
+ | -targetIntervals Realign_Intervals.txt \ | ||
+ | -known Mills_and_1000G_gold_standard.indels.b37.vcf \ | ||
+ | 2> IndelRealigner.error | ||
− | + | === Base Quality Score Recalibration === | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ | |
− | + | -T PrintReads \ | |
− | + | -I Sample1.realign.bam \ | |
− | + | -o Sample1.recal.bam \ | |
− | + | -R human_g1K_v37.fasta \ | |
− | java | + | -BQSR recalibration_report.grp \ |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | <span style="background:#FFFF00"> | ||
=== BAM Quality Control === | === BAM Quality Control === | ||
− | + | <span style="background:#FFFF00"> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
CollectAlignmentSummaryMetrics | CollectAlignmentSummaryMetrics | ||
− | java -jar | + | java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \ |
I=/output/filename.b37_1kg.sorted.bam \ | I=/output/filename.b37_1kg.sorted.bam \ | ||
O=/output/filename.b37_1kg.AlignmentSummaryMetrics \ | O=/output/filename.b37_1kg.AlignmentSummaryMetrics \ | ||
− | R= | + | R=human_g1K_v37.fasta \ |
VALIDATION_STRINGENCY=LENIENT \ | VALIDATION_STRINGENCY=LENIENT \ | ||
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
CollectGcBiasMetrics | CollectGcBiasMetrics | ||
− | java -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \ | + | java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \ |
− | R= | + | R=human_g1K_v37.fasta \ |
− | I=/output/filename.b37_1kg.sorted.bam | + | I=/output/filename.b37_1kg.sorted.bam \ |
− | O=/output/filename.b37_1kg.GcBiasMetrics | + | O=/output/filename.b37_1kg.GcBiasMetrics \ |
− | CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf | + | CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf \ |
− | VALIDATION_STRINGENCY=LENIENT | + | VALIDATION_STRINGENCY=LENIENT \ |
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
CollectInsertSizeMetrics | CollectInsertSizeMetrics | ||
− | java -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \ | + | java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \ |
− | I=/output/filename.b37_1kg.sorted.bam | + | I=/output/filename.b37_1kg.sorted.bam \ |
− | O=/output/filename.b37_1kg.CollectInsertSizeMetrics | + | O=/output/filename.b37_1kg.CollectInsertSizeMetrics \ |
− | H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf | + | H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf \ |
− | VALIDATION_STRINGENCY=LENIENT | + | VALIDATION_STRINGENCY=LENIENT \ |
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
MeanQualityByCycle | MeanQualityByCycle | ||
− | java -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \ | + | java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \ |
− | I=/output/filename.b37_1kg.sorted.bam | + | I=/output/filename.b37_1kg.sorted.bam \ |
− | O=/output/filename.b37_1kg.MeanQualityByCycle | + | O=/output/filename.b37_1kg.MeanQualityByCycle \ |
− | CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf | + | CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf \ |
− | VALIDATION_STRINGENCY=LENIENT | + | VALIDATION_STRINGENCY=LENIENT \ |
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
QualityScoreDistribution | QualityScoreDistribution | ||
− | java | + | java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar \ |
I=/output/filename.b37_1kg.sorted.bam \ | I=/output/filename.b37_1kg.sorted.bam \ | ||
O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution] \ | O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution] \ | ||
CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf \ | CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf \ | ||
VALIDATION_STRINGENCY=LENIENT \ | VALIDATION_STRINGENCY=LENIENT \ | ||
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
BamIndexStats | BamIndexStats | ||
− | java -jar /tools/picard-tools-1.32/BamIndexStats.jar \ | + | java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \ |
− | INPUT=/output/filename.b37_1kg.sorted.bam | + | INPUT=/output/filename.b37_1kg.sorted.bam \ |
− | VALIDATION_STRINGENCY=LENIENT | + | VALIDATION_STRINGENCY=LENIENT \ |
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | <span style="background:#FFFF00"> | ||
CalculateHsMetricsWholeGenome | CalculateHsMetricsWholeGenome | ||
− | java -jar | + | java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \ |
INPUT=/output/filename.b37_1kg.sorted.bam \ | INPUT=/output/filename.b37_1kg.sorted.bam \ | ||
OUTPUT=/output/filename.b37_1kg.HsMetrics \ | OUTPUT=/output/filename.b37_1kg.HsMetrics \ | ||
Line 363: | Line 239: | ||
TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ | TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ | ||
VALIDATION_STRINGENCY=LENIENT \ | VALIDATION_STRINGENCY=LENIENT \ | ||
− | TMP_DIR=/ | + | TMP_DIR=/big/drive |
+ | </span> | ||
− | === | + | === Reduce Reads === |
− | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ | |
− | + | -T ReduceReads \ | |
− | + | -I Sample1.recal.bam \ | |
+ | -o Sample1.reduced.bam \ | ||
+ | -R human_g1K_v37.fasta \ | ||
+ | 2> ReduceReads.error | ||
== Variant Calling == | == Variant Calling == | ||
=== UnifiedGenotyper === | === UnifiedGenotyper === | ||
− | java -jar GenomeAnalysisTK.jar \ | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ |
− | - | + | -T UnifiedGenotyper \ |
− | - | + | -I Sample1.reduced.bam \ |
− | -I | + | -I Sample2.reduced.bam \ |
− | - | + | -I Background1.reduced.bam \ |
− | -o | + | -I Background2.reduced.bam \ |
− | - | + | -o Sample1.raw.vcf \ |
− | - | + | -R human_g1K_v37.fasta \ |
− | -dcov | + | --dbsnp dbSNP.vcf \ |
− | + | -dcov 250 \ | |
− | + | -L TruSeq_exome_regions.bed \ | |
− | + | 2> UnifiedGenotyper.error | |
=== VariantRecalibrator === | === VariantRecalibrator === | ||
java -Xmx4g -jar GenomeAnalysisTK.jar \ | java -Xmx4g -jar GenomeAnalysisTK.jar \ | ||
− | -T VariantRecalibrator \ | + | -T VariantRecalibrator \ |
− | -R | + | -R human_g1K_v37.fasta \ |
− | -input | + | -input Sample1.raw.vcf \ |
− | -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37 | + | -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ |
− | -resource:omni,known=false,training=true,truth= | + | -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ |
− | -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 | + | -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf \ |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | ADD 1KG | |
− | + | -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \ | |
+ | -mode SNP \ | ||
+ | -recalFile Sample1.recal \ | ||
+ | -tranchesFile Sample1.tranches \ | ||
+ | -rscriptFile Sample1.plots.R \ | ||
+ | 2> VariantRecalibrator.error | ||
− | === | + | === ApplyRecalibration === |
− | + | java -Xmx4g -jar GenomeAnalysisTK.jar \ | |
− | + | -T ApplyRecalibration \ | |
− | + | -input Sample1.raw.vcf \ | |
− | + | -o Sample1.vqsr.vcf \ | |
− | + | -R human_g1K_v37.fasta \ | |
− | + | --ts_filter_level 99.0 \ | |
− | + | -tranchesFile Sample1.tranches \ | |
− | + | -recalFile Sample1.recal \ | |
− | + | -mode SNP \ | |
− | + | 2> ApplyRecalibration.error | |
− | == | + | == Variant File QC == |
− | = | + | <span style="background:#FFFF00"> |
+ | Quality Metrics on variants | ||
+ | *Ti/Tv Ratio (2.1 for WGS ~2.8 for exome) | ||
+ | *HapMap concordance | ||
+ | *SNV/Indel Counts | ||
+ | *Rare variant enrichment | ||
+ | *DP | ||
+ | *Q | ||
+ | *GQ |
Latest revision as of 19:04, 6 September 2013
Contents
Utah Genome Project
Variant Calling Pipeline Version 0.0.2
Sept. 2013
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.5
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
- Reference Genome (GRCh37):
- human_g1K_v37.fasta
- High confidence SNVs
- 1000G_omni2.5.b37.vcf
- hapmap_3.3.b37.vcf
- dbSNP
- dbsnp_137.b37.vcf.gz
- Known indels
- Mills_and_1000G_gold_standard.indels.b37.vcf.gz
- 1000G_phase1.indels.b37.vcf
Sequencing
This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding.
Validate File Integrity with md5sum
An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.
md5sum file.fastq > file_local.md5 diff file_local.md5 file_provided.md5
If the md5sum signature differs from that provided for the file:
- Check to be sure you have the correct file.
- Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
- Try the download again.
- Contact the sequence provider.
FastQ File Analyses
fastqc Sample1_L1_R1.txt
From the fastqc_data.txt file we check the following values:
- Encoding (must be Sanger / Illumina 1.9)
- Total Sequences (Need to develop a acceptable range)
- Filtered Sequences (Should be 0 or at least very low)
- Sequence length (must be ~ 100 bp)
- %GC (should be 45 < x < 55)
- Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).
Alignment
Align reads to the genome with bwa.
Index the reference sequence if this has not already been done
bwa index -a bwtsw human_g1K_v37.fasta
The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.
bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair bwa aln -q 15 human_g1K_v37.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
bwa sampe -P human_g1K_v37.fasta Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -
This will switch to bwa mem soon.
bwa mem human_g1K_v37.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bt human_g1K_v37.fasta.fai -o Sample1_L1 -
Bwa Command Line Options:
- -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
- -o INT maximum number or fraction of gap opens [1]
- -e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
- -i INT do not put an indel within INT bp towards the ends [5]
- -d INT maximum occurrences for extending a long deletion [10]
- -l INT seed length [32]
- -k INT maximum differences in the seed [2]
- -m INT maximum entries in the queue [2000000]
- -t INT number of threads [1]
- -M INT mismatch penalty [3]
- -O INT gap open penalty [11]
- -E INT gap extension penalty [4]
- -R INT stop searching when there are >INT equally best hits [30]
- -q INT quality threshold for read trimming down to 35bp [0]
- -c input sequences are in the color space
- -L log-scaled gap penalty for long deletions
- -N non-iterative mode: search for all n-difference hits (slooow)
- -f FILE file to write output to instead of stdout
The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files
java -Xmx4g -jar picard-tools/AddOrReplaceReadGroups.jar \ INPUT=Sample1_L1.bam \ OUTPUT=Sample1_L1.rg.bam \ CREATE_INDEX=TRUE \ VALIDATION_STRINGENCY=LENIENT \ SORT_ORDER=coordinate \ RGID=9958X1 \ RGLB=9958X1 \ RGPL=illumina \ RGPU=1 \ TMP_DIR=/big/drive \ 2> 9958X1_AddOrReplaceReadGroups.error
BAM File Analyses
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
Merge lane level BAMs to individual
java -Xmx4g -jar picard-tools/MergeSamFiles.jar \ INPUT=Sample1_L1.rg.bam \ INPUT=Sample1_L2.rg.bam \ OUTPUT=Sample1.bam \ CREATE_INDEX=TRUE \ ASSUME_SORTED=true \ USE_THREADING=true \ VALIDATION_STRINGENCY=SILENT \ TMP_DIR=/big/drive \ 2> MergeSameFiles.error
Mark Duplicates
Remove PCR/Optical duplicate reads
java -Xmx4g -jar MarkDuplicates.jar \ INPUT=Sample1.bam \ OUTPUT=Sample1.dedup.bam \ METRICS_FILE=lane1_dup_metrics.txt \ ASSUME_SORTED=true \ USE_THREADING=true \ VALIDATION_STRINGENCY=SILENT \ TMP_DIR=/big/drive \ 2> MarkDuplicates.error
Develop range for duplicate levels here
Local Realignment of Indels
java -Xmx4g -jar GenomeAnalysisTK.jar \ -I All_BAMs.list \ -o Realign_Intervals.txt \ -T RealignerTargetCreator \ -R human_g1K_v37.fasta \ -known Mills_and_1000G_gold_standard.indels.b37.vcf \ -known 1000G_phase1.indels.b37.vcf \ -nt 24 \ 2> RealignerTargetCreator.error
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -I Sample1.dedup.bam \ -o Sample1.realign.bam \ -R human_g1K_v37.fasta \ -targetIntervals Realign_Intervals.txt \ -known Mills_and_1000G_gold_standard.indels.b37.vcf \ 2> IndelRealigner.error
Base Quality Score Recalibration
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T PrintReads \ -I Sample1.realign.bam \ -o Sample1.recal.bam \ -R human_g1K_v37.fasta \ -BQSR recalibration_report.grp \
BAM Quality Control
CollectAlignmentSummaryMetrics
java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.AlignmentSummaryMetrics \ R=human_g1K_v37.fasta \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
CollectGcBiasMetrics
java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \ R=human_g1K_v37.fasta \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.GcBiasMetrics \ CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
CollectInsertSizeMetrics
java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.CollectInsertSizeMetrics \ H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
MeanQualityByCycle
java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.MeanQualityByCycle \ CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
QualityScoreDistribution
java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution] \ CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
BamIndexStats
java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
CalculateHsMetricsWholeGenome
java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ OUTPUT=/output/filename.b37_1kg.HsMetrics \ BAIT_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/big/drive
Reduce Reads
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -I Sample1.recal.bam \ -o Sample1.reduced.bam \ -R human_g1K_v37.fasta \ 2> ReduceReads.error
Variant Calling
UnifiedGenotyper
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -I Sample1.reduced.bam \ -I Sample2.reduced.bam \ -I Background1.reduced.bam \ -I Background2.reduced.bam \ -o Sample1.raw.vcf \ -R human_g1K_v37.fasta \ --dbsnp dbSNP.vcf \ -dcov 250 \ -L TruSeq_exome_regions.bed \ 2> UnifiedGenotyper.error
VariantRecalibrator
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R human_g1K_v37.fasta \ -input Sample1.raw.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf \ -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf \
ADD 1KG
-an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \ -mode SNP \ -recalFile Sample1.recal \ -tranchesFile Sample1.tranches \ -rscriptFile Sample1.plots.R \ 2> VariantRecalibrator.error
ApplyRecalibration
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -input Sample1.raw.vcf \ -o Sample1.vqsr.vcf \ -R human_g1K_v37.fasta \ --ts_filter_level 99.0 \ -tranchesFile Sample1.tranches \ -recalFile Sample1.recal \ -mode SNP \ 2> ApplyRecalibration.error
Variant File QC
Quality Metrics on variants
- Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
- HapMap concordance
- SNV/Indel Counts
- Rare variant enrichment
- DP
- Q
- GQ