Revision as of 17:33, 6 January 2014

Utah Genome Project

Variant Calling Pipeline Version 0.0.3

Sept. 2013

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37):
- human_g1k_v37_decoy.fasta

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
- known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf

Background BAMS used running UnifiedGenotyper
- 1000Genomes sets:
- CEU
- GBR

Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf

Sequencing

This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5

If the md5sum signature differs from that provided for the file:

Check to be sure you have the correct file.
Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
Try the download again.
Contact the sequence provider.

FastQ File Analyses

fastqc Sample1_L1_R1.txt

From the fastqc_data.txt file we check the following values:

Encoding (must be Sanger / Illumina 1.9)
Total Sequences (Need to develop a acceptable range)
Filtered Sequences (Should be 0 or at least very low)
Sequence length (must be ~ 100 bp)
%GC (should be 45 < x < 55)
Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).

Alignment

Align reads to the genome with bwa.

Index the reference sequence if this has not already been done

bwa index -a bwtsw human_g1k_v37_decoy.fasta

The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.

bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair
bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair

The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size and read group added.

bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -

This will switch to bwa mem soon.

bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -

BAM File Analyses

Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.

Merge lane level BAMs to individual

java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar

    INPUT=Sample1_L1.rg.bam                     
    INPUT=Sample1_L2.rg.bam                    
    OUTPUT=Sample1.bam                          
    VALIDATION_STRINGENCY: LENIENT
    MAX_RECORDS_IN_RAM: 5000000
    CREATE_INDEX: True
    SORT_ORDER: coordinate
    ASSUME_SORTED: True 
    MERGE_SEQUENCE_DICTIONARIES:
    USE_THREADING: True
    2> MergeSamFiles.report

Mark Duplicates

Remove PCR/Optical duplicate reads

java -Xmx4g -jar MarkDuplicates.jar    \
   INPUT=Sample1.bam                   \
   OUTPUT=Sample1.dedup.bam            \
   METRICS_FILE=lane1_dup_metrics.txt  \
   ASSUME_SORTED=true                  \
   USE_THREADING=true                  \
   VALIDATION_STRINGENCY=SILENT        \
   TMP_DIR=/big/drive                  \
   2> MarkDuplicates.error

Develop range for duplicate levels here

Local Realignment of Indels

java -Xmx4g -jar GenomeAnalysisTK.jar                  \
   -I All_BAMs.list                                    \
   -o Realign_Intervals.txt                            \
   -T RealignerTargetCreator                           \
   -R human_g1K_v37.fasta                              \
   -known Mills_and_1000G_gold_standard.indels.b37.vcf \
   -known 1000G_phase1.indels.b37.vcf                  \
   -nt 24                                              \
   2> RealignerTargetCreator.error

java -Xmx4g -jar GenomeAnalysisTK.jar                  \
   -T IndelRealigner                                   \
   -I Sample1.dedup.bam                                \
   -o Sample1.realign.bam                              \
   -R human_g1K_v37.fasta                              \
   -targetIntervals Realign_Intervals.txt              \
   -known Mills_and_1000G_gold_standard.indels.b37.vcf \
   2> IndelRealigner.error

Base Quality Score Recalibration

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T PrintReads                       \
  -I Sample1.realign.bam              \
  -o Sample1.recal.bam                \
  -R human_g1K_v37.fasta              \
  -BQSR recalibration_report.grp      \

BAM Quality Control

CollectAlignmentSummaryMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \
   I=/output/filename.b37_1kg.sorted.bam                                      \
   O=/output/filename.b37_1kg.AlignmentSummaryMetrics                         \
   R=human_g1K_v37.fasta                                                     \
   VALIDATION_STRINGENCY=LENIENT                                             \
   TMP_DIR=/big/drive

CollectGcBiasMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \
   R=human_g1K_v37.fasta                                           \
   I=/output/filename.b37_1kg.sorted.bam                            \
   O=/output/filename.b37_1kg.GcBiasMetrics                         \
   CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf                 \
   VALIDATION_STRINGENCY=LENIENT                                   \
   TMP_DIR=/big/drive

CollectInsertSizeMetrics

java -Xmx4g -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \
   I=/output/filename.b37_1kg.sorted.bam                                \
   O=/output/filename.b37_1kg.CollectInsertSizeMetrics                  \
   H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf              \
   VALIDATION_STRINGENCY=LENIENT                                       \
   TMP_DIR=/big/drive

MeanQualityByCycle

java -Xmx4g -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \
   I=/output/filename.b37_1kg.sorted.bam                          \
   O=/output/filename.b37_1kg.MeanQualityByCycle                  \
   CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf          \
   VALIDATION_STRINGENCY=LENIENT                                 \
   TMP_DIR=/big/drive

QualityScoreDistribution

java -Xmx4g -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar  \
   I=/output/filename.b37_1kg.sorted.bam                                 \
   O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution]            \
   CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf    \
   VALIDATION_STRINGENCY=LENIENT                                        \
   TMP_DIR=/big/drive

BamIndexStats

java -Xmx4g -jar /tools/picard-tools-1.32/BamIndexStats.jar \
   INPUT=/output/filename.b37_1kg.sorted.bam                 \
   VALIDATION_STRINGENCY=LENIENT                            \
   TMP_DIR=/big/drive

CalculateHsMetricsWholeGenome

java -Xmx4g -jar /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \
   INPUT=/output/filename.b37_1kg.sorted.bam                                 \
   OUTPUT=/output/filename.b37_1kg.HsMetrics                                 \
   BAIT_INTERVALS=/resources//hg19/intervals/GoNL.interval_list             \
   TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list           \
   VALIDATION_STRINGENCY=LENIENT                                            \
   TMP_DIR=/big/drive

Reduce Reads

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T ReduceReads                      \
  -I Sample1.recal.bam                \
  -o Sample1.reduced.bam              \
  -R human_g1K_v37.fasta              \
  2> ReduceReads.error

Variant Calling

UnifiedGenotyper

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T UnifiedGenotyper                  \
  -I Sample1.reduced.bam              \
  -I Sample2.reduced.bam              \
  -I Background1.reduced.bam          \
  -I Background2.reduced.bam          \
  -o Sample1.raw.vcf                  \
  -R human_g1K_v37.fasta              \
  --dbsnp dbSNP.vcf                    \
  -dcov 250                           \
  -L TruSeq_exome_regions.bed         \
  2> UnifiedGenotyper.error

VariantRecalibrator

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T VariantRecalibrator              \
  -R human_g1K_v37.fasta              \
  -input Sample1.raw.vcf              \
  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf          \
  -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf        \
  -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_137.b37.vcf            \

ADD 1KG

  -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \
  -mode SNP                           \
  -recalFile Sample1.recal            \
  -tranchesFile Sample1.tranches      \
  -rscriptFile Sample1.plots.R        \
  2> VariantRecalibrator.error

ApplyRecalibration

java -Xmx4g -jar GenomeAnalysisTK.jar \
  -T ApplyRecalibration               \
  -input Sample1.raw.vcf              \
  -o Sample1.vqsr.vcf                 \
  -R human_g1K_v37.fasta              \
  --ts_filter_level 99.0                \
  -tranchesFile Sample1.tranches      \
  -recalFile Sample1.recal            \
  -mode SNP                           \
  2> ApplyRecalibration.error

Variant File QC

Quality Metrics on variants

Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
HapMap concordance
SNV/Indel Counts
Rare variant enrichment
DP
Q
GQ

@@ Line 73: / Line 73: @@
   bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
-The 'bwa sampe'.  For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
+The 'bwa sampe'.  For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size and read group added.
-  bwa sampe -P human_g1k_v37_decoy.fasta Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bt human_g1k_v37_decoy.fasta.fai -o Sample1_L1 -
+  bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -
 <span style="background:#FFFF00">This will switch to bwa mem soon.
-  bwa mem human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bt human_g1k_v37_decoy.fasta.fai -o Sample1_L1 -
+  bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -
 </span>
-Bwa Command Line Options:
-* -n NUM    max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
-* -o INT    maximum number or fraction of gap opens [1]
-* -e INT    maximum number of gap extensions, -1 for disabling long gaps [-1]
-* -i INT    do not put an indel within INT bp towards the ends [5]
-* -d INT    maximum occurrences for extending a long deletion [10]
-* -l INT    seed length [32]
-* -k INT    maximum differences in the seed [2]
-* -m INT    maximum entries in the queue [2000000]
-* -t INT    number of threads [1]
-* -M INT    mismatch penalty [3]
-* -O INT    gap open penalty [11]
-* -E INT    gap extension penalty [4]
-* -R INT    stop searching when there are >INT equally best hits [30]
-* -q INT    quality threshold for read trimming down to 35bp [0]
-* -c        input sequences are in the color space
-* -L        log-scaled gap penalty for long deletions
-* -N        non-iterative mode: search for all n-difference hits (slooow)
-* -f FILE   file to write output to instead of stdout
-The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files
- java -Xmx4g -jar picard-tools/AddOrReplaceReadGroups.jar \
-    INPUT=Sample1_L1.bam                                             \
-    OUTPUT=Sample1_L1.rg.bam                                         \
-    CREATE_INDEX=TRUE                                                \
-    VALIDATION_STRINGENCY=LENIENT                                    \
-    SORT_ORDER=coordinate                                            \
-    RGID=9958X1                                                      \
-    RGLB=9958X1                                                      \
-    RGPL=illumina                                                    \
-    RGPU=1                                                           \
-    TMP_DIR=/big/drive                                               \
-> 9958X1_AddOrReplaceReadGroups.error
 == BAM File Analyses ==
@@ Line 125: / Line 89: @@
 === Merge lane level BAMs to individual ===
- java -Xmx4g -jar picard-tools/MergeSamFiles.jar \
+java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar
-      INPUT=Sample1_L1.rg.bam                     \
+      INPUT=Sample1_L1.rg.bam
-      INPUT=Sample1_L2.rg.bam                     \
+      INPUT=Sample1_L2.rg.bam
-      OUTPUT=Sample1.bam                          \
+      OUTPUT=Sample1.bam
-      CREATE_INDEX=TRUE                           \
+     VALIDATION_STRINGENCY: LENIENT
-      ASSUME_SORTED=true                          \
+     MAX_RECORDS_IN_RAM: 5000000
-      USE_THREADING=true                          \
+      CREATE_INDEX: True
-      VALIDATION_STRINGENCY=SILENT                \
+      SORT_ORDER: coordinate
-      TMP_DIR=/big/drive                          \
+      ASSUME_SORTED: True
-> MergeSameFiles.error
+      MERGE_SEQUENCE_DICTIONARIES:
+      USE_THREADING: True
+> MergeSamFiles.report
 === Mark Duplicates ===

Difference between revisions of "UGP Variant Pipeline 0.0.3"

Revision as of 17:33, 6 January 2014

Contents

Utah Genome Project

Variant Calling Pipeline Version 0.0.3

Sept. 2013

Data Source

Sequencing

Validate File Integrity with md5sum

FastQ File Analyses

Alignment

BAM File Analyses

Merge lane level BAMs to individual

Mark Duplicates

Local Realignment of Indels

Base Quality Score Recalibration

BAM Quality Control

Reduce Reads

Variant Calling

UnifiedGenotyper

VariantRecalibrator

ApplyRecalibration

Variant File QC

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools