Difference between revisions of "UGP Variant Pipeline 0.0.3"
Line 28: | Line 28: | ||
**known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf | **known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf | ||
− | *Background BAMS used running UnifiedGenotyper | + | *1000Genomes Background BAMS used running UnifiedGenotyper |
− | |||
**CEU | **CEU | ||
**GBR | **GBR |
Revision as of 18:27, 6 January 2014
Contents
Utah Genome Project
Variant Calling Pipeline Version 0.0.3
Jan. 2014
Software Versions
- GenomeAnalysisTK-2.8-1
- Picard : Version: 1.99
- FastQC v0.10.1
- Samtools Version: 0.1.19
- BWA Version: 0.7.5
- cApTUrE 0.02
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
- Reference Genome (GRCh37):
- human_g1k_v37_decoy.fasta
- VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: /data/GATK_Bundle/Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: /data/GATK_Bundle/1000G_phase1.indels.b37.vcf
- known_dbsnp: /data/GATK_Bundle/dbsnp_137.b37.vcf
- 1000Genomes Background BAMS used running UnifiedGenotyper
- CEU
- GBR
- Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf
- Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf
Sequencing
This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [1]
Validate File Integrity with md5sum
An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.
md5sum file.fastq > file_local.md5 diff file_local.md5 file_provided.md5
If the md5sum signature differs from that provided for the file:
- Check to be sure you have the correct file.
- Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
- Try the download again.
- Contact the sequence provider.
FastQ File Analyses
fastqc Sample1_L1_R1.txt
From the fastqc_data.txt file we check the following values:
- Encoding (must be Sanger / Illumina 1.9)
- Total Sequences (Need to develop a acceptable range)
- Filtered Sequences (Should be 0 or at least very low)
- Sequence length (must be ~ 100 bp)
- %GC (should be 45 < x < 55)
- Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).
Alignment
Align reads to the genome with bwa.
Index the reference sequence if this has not already been done
bwa index -a bwtsw human_g1k_v37_decoy.fasta
The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data.
bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R1.sai # One lane of reads first in pair bwa aln -q 15 human_g1k_v37_decoy.fasta file.fastq > Sample1_L1_R2.sai # One lane of reads second in pair
The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size and read group added.
bwa sampe -r "read group" -P Sample1_L1_R1.sai Sample1_L1_R1.sai Sample1_L1_R1.fastq Sample1_L1_R2.fastq | samtools view -bSho BAM_FILE -
This will switch to bwa mem soon.
bwa mem -M -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samtools view -bSho BAM_FILE -
BAM File Analyses
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
Merge lane level BAMs to individual
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MergeSamFiles.jar INPUT=Sample1_L1.rg.bam INPUT=Sample1_L2.rg.bam OUTPUT=Sample1.bam VALIDATION_STRINGENCY: LENIENT MAX_RECORDS_IN_RAM: 5000000 CREATE_INDEX: True SORT_ORDER: coordinate ASSUME_SORTED: True MERGE_SEQUENCE_DICTIONARIES: USE_THREADING: True 2> MergeSamFiles.report
Mark Duplicates
Remove PCR/Optical duplicate reads
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp MarkDuplicates.jar INPUT=Sample1.bam OUTPUT=Sample1.dedup.bam METRICS_FILE=lane1_dup_metrics.txt VALIDATION_STRINGENCY: LENIENT COMPRESSION_LEVEL: MAX_RECORDS_IN_RAM: 5000000 CREATE_INDEX: True ASSUME_SORTED: True 2> MarkDuplicates.log
Develop range for duplicate levels here
Local Realignment of Indels
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -I dedup_bam.list -o Realign_Intervals.txt -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -known [list from above] -nt 24 2> RealignerTargetCreator.log
Base Quality Score Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T PrintReads -I Sample1.realign.bam -o Sample1.recal.bam -R human_g1k_v37_decoy.fasta -BQSR recalibration_report.grp
BAM Quality Control
CollectAlignmentSummaryMetrics
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp CollectAlignmentSummaryMetrics.jar I=/output/filename.b37_1kg.sorted.bam O=/output/filename.b37_1kg.AlignmentSummaryMetrics R= human_g1k_v37_decoy.fasta VALIDATION_STRINGENCY=LENIENT PROGRAM: QualityScoreDistribution
ReduceReads
java -Xmx10g -XX:ParallelGCThreads=10 -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ReduceReads -I Sample1.recal.bam -o Sample1.reduced.bam -R human_g1k_v37_decoy.fasta 2> ReduceReads.report
Variant Calling
UnifiedGenotyper
SNP calling
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T UnifiedGenotyper -I reduced_bam.list -I 1000G background_bam.list -o Sample1.raw.vcf -R human_g1k_v37_decoy.fasta -nt 4 -nct 6 -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm SNP -out_mode EMIT_VARIANTS_ONLY -L NCBI Ref_GRCh37 region.bed 2> UnifiedGenotyper.report
INDEL calling
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T UnifiedGenotyper -I reduced_bam.list -I 1000G background_bam.list -o Sample1.raw.vcf -R human_g1k_v37_decoy.fasta -nt 4 -nct 6 -stand_call_conf 30.0 -stand_emit_conf 30.0 -glm INDEL -out_mode EMIT_VARIANTS_ONLY -L NCBI Ref_GRCh37 region.bed 2> UnifiedGenotyper.report
VariantRecalibrator
SNP Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta -input Sample1.raw.vcf -nt 25 -resource: hapmap,known=false,training=true,truth=true,prior=15.0 -resource: omni,known=false,training=true,truth=true,prior=12.0 -resource: 1000G,known=false,training=true,truth=false,prior=10.0 -use_annotation: QD -use_annotation: HaplotypeScore -use_annotation: MQRankSum -use_annotation: ReadPosRankSum -use_annotation: FS -numBadVariants: 5000 -mode SNP -recalFile Sample1_SNP.recal -tranchesFile Sample1_SNP.tranches -rscriptFile Sample1_SNP.plots.R 2> VariantRecalibrator_SNP.report
INDEL Recalibration
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta -input Sample1.raw.vcf -nt 25 -resource: mills,known=false,training=true,truth=true,prior=12.0 -resource: 1000G,known=false,training=true,truth=true,prior=10.0 -use_annotation: MQRankSum -use_annotation: ReadPosRankSum -use_annotation: FS -numBadVariants: 5000 -mode INDEL -recalFile Sample1_INDEL.recal -tranchesFile Sample1_INDEL.tranches -rscriptFile Sample1_INDEL.plots.R 2> VariantRecalibrator_INDEL.report
ApplyRecalibration
SNP Apply
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -input Sample1_raw_SNP.vcf -o Sample1_vqsr_SNP.vcf -R human_g1k_v37_decoy.fasta -nt 25 -ts_filter_level: 99.0 -excludeFiltered : TRUE -tranchesFile Sample1.tranches -recalFile Sample1.recal -mode SNP 2> ApplyRecalibration_SNP.report
SNP INDEL
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -input Sample1_raw_INDEL.vcf -o Sample1_vqsr_INDEL.vcf -R human_g1k_v37_decoy.fasta -nt 25 -ts_filter_level: 99.0 -excludeFiltered : TRUE -tranchesFile Sample1.tranches -recalFile Sample1.recal -mode INDEL 2> ApplyRecalibration_INDEL.report
SelectVariants
These commands will remove the background files and output SNP and INDEL files, then combine them into a single VCF file.
Select SNP
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37_decoy.fasta --variant SNP_variant -o output_SNP.file 2> SelectVariants_SNP.report
Select INDEL
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T SelectVariants -R human_g1k_v37_decoy.fasta --variant INDEL_variant -o output_INDEL.file 2> SelectVariants_INDEL.report
CombineVarients
java -Xmx10g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37_decoy.fasta --variant INDEL_variant --variant output_INDEL.file --variant output_SNP.file -o Final.vcf 2> CombineVarients.report
Variant File QC
Quality Metrics on variants
- Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
- HapMap concordance
- SNV/Indel Counts
- Rare variant enrichment
- DP
- Q
- GQ