UGP Variant Pipeline 0.0.2
Contents
- 1 Utah Genome Project
- 1.1 Variant Calling Pipeline 0.0.2
- 1.2 Sept. 2013
- 1.3 Sequencing
- 1.4 FastQ File Analyses
- 1.5 Alignment
- 1.6 BAM File Analyses
- 1.6.1 Merge lane level BAMs to individual
- 1.6.2 GATK Local Realignment of Indels
- 1.6.3 Fix Mate Information
- 1.6.4 GATK Quality Score Recalibration
- 1.6.5 Generate the SAM MD Tag
- 1.6.6 Strip Extraneous BAM Tags
- 1.6.7 Merge BAMS
- 1.6.8 Mark PCR Duplicates
- 1.6.9 Merge All BAMS for the Project
- 1.6.10 BAM Quality Control
- 1.6.11 Split BAMS by Chromosome
- 1.6.12 Compress BAMS with ReduceRead
- 1.7 Variant Calling
- 1.8 Variant File Analyses
- 1.9 Family Based Analyses
- 1.10 Population Based Analyses
- 1.11 Phenotype Based Analyses
Utah Genome Project
Variant Calling Pipeline 0.0.2
Sept. 2013
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.5
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
- Reference Genome (GRCh37):
human_g1K_v37.fasta
- High confidence SNVs
1000G_omni2.5.b37.vcf hapmap_3.3.b37.vcf
- dbSNP
dbsnp_137.b37.vcf.gz
- High confidence indels
Mills_and_1000G_gold_standard.indels.b37.vcf.gz
Sequencing
This pipeline is designed for 100 bp Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding.
Validate File Integrity with md5sum
An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.
md5sum file.fastq > file_local.md5 diff file_local.md5 file_provided.md5
If the md5sum signature differs from that provided for the file:
- Check to be sure you have the correct file.
- Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
- Try the download again.
- Contact the sequence provider.
FastQ File Analyses
fastqc 9958X1_130327_700179R_0404_BD1YNUACXX_7_1.txt
From the fastqc_data.txt file we check the following values:
- Encoding (must be Sanger / Illumina 1.9)
- Total Sequences (Need to develop a acceptable range)
- Filtered Sequences (Should be 0 or at least very low)
- Sequence length (must be ~ 100 bp)
- %GC (should be 45 < x < 55)
- Total Duplicate Percentage (This value may not be valuable - an acceptable range has not been determined).
Alignment
Index the reference sequence if this has not already been done
bwa index -a bwtsw reference_genome.fasta
The 'bwa aln' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project for aligning Illumina data. Will switch to bwa mem soon.
bwa aln -q 15 reference.fasta file.fastq > output.sai
The 'bwa sampe'. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size.
bwa sampe -P reference.fasta output1.sai output2.sai file1.fastq file2.fastq > out.sam
Bwa Command Line Options:
- -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
- -o INT maximum number or fraction of gap opens [1]
- -e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
- -i INT do not put an indel within INT bp towards the ends [5]
- -d INT maximum occurrences for extending a long deletion [10]
- -l INT seed length [32]
- -k INT maximum differences in the seed [2]
- -m INT maximum entries in the queue [2000000]
- -t INT number of threads [1]
- -M INT mismatch penalty [3]
- -O INT gap open penalty [11]
- -E INT gap extension penalty [4]
- -R INT stop searching when there are >INT equally best hits [30]
- -q INT quality threshold for read trimming down to 35bp [0]
- -c input sequences are in the color space
- -L log-scaled gap penalty for long deletions
- -N non-iterative mode: search for all n-difference hits (slooow)
- -f FILE file to write output to instead of stdout
The AddOrReplaceReadGroups tool will add the correct readgroup ID to the BAM files
nohup java -d64 -Xmx8g -jar /usr/local/picard-tools-1.88/AddOrReplaceReadGroups.jar \ INPUT=9958X1_7_1.bam \ OUTPUT=9958X1_RG.bam \ CREATE_INDEX=TRUE \ MAX_RECORDS_IN_RAM=2000000 \ VALIDATION_STRINGENCY=SILENT \ TMP_DIR=/tmp \ SORT_ORDER=coordinate \ RGID=9958X1 \ RGLB=9958X1 \ RGPL=illumina \ RGPU=1 \ 2> 9958X1_AddOrReplaceReadGroups.error &
BAM File Analyses
Alignment BAM files are improved in various ways to help increase the quality and speed of subsequent variant calling steps.
TODO: Should probably merge all lanes per individual here so that MarkDuplicates and IndelRealigner does their best work.
Merge lane level BAMs to individual
java -jar -Xmx4g /tools/picard-tools-1.90/MergeSamFiles.jar \ INPUT=alignment.bam \ INPUT=alignment.bam \ OUTPUT=merged.bam \ CREATE_INDEX=TRUE \ ASSUME_SORTED=true \ USE_THREADING=true \ TMP_DIR=/tmp \ MAX_RECORDS_IN_RAM=30000000 \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=SILENT
GATK Local Realignment of Indels
java -Xmx10g -jar /usr/local/GenomeAnalysisTK-2.4-7-g5e89f01//GenomeAnalysisTK.jar \ -I all_bams.list \ -T RealignerTargetCreator \ -R fasta/hs_ref_GRCh37.p10_ALL_chr_hs37d5.fa \ -o realign.intervals \ -known known_indels/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf \ -known known_indels/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf \ -nt 24 \ > RealignerTargetCreator \ 2> RealignerTargetCreator.error &
java -Xmx10g -jar /usr/local/GenomeAnalysisTK-2.4-7-g5e89f01/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R fasta/hs_ref_GRCh37.p10_ALL_chr_hs37d5.fa \ -I 9958X1_RG.bam \ -I 9958X2_RG.bam \ -I 9958X3_RG.bam \ -I 9958X4_RG.bam \ -targetIntervals realign.intervals \ -o 9958X_Realigned.bam \ -known known_indels/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf \ -known known_indels/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf \ > IndelRealigner.out \ 2> IndelRealigner.error
Or this ??:
java -Djava.io.tmpdir=/local -Xmx8g -jar GenomeAnalysisTK.jar \ -l INFO -T IndelRealigner \ -U ALLOW_UNINDEXED_BAM \ -I /output/filename.b37_1kg.dedup.bam \ -targetIntervals /resources//hg19/intervals/realign_intervals_hg19_b37_1kg.intervals \ -R /resources//hg19/indices/b37_1kg.fa \ -D /resources//hg19/dbsnp/dbsnp_129_b37_b37_1kg.rod \ -[B:indels,VCF B:indels,VCF] /resources//hg19/indels/1kg.pilot_release.merged.indels.sites./hg19.b37_1kg.vcf \ -o /output/filename.b37_1kg.realigned.bam -knownsOnly -LOD 0.4 -compress 0
Known indels are from the following files:
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz
- ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz
Fix Mate Information
java -jar -Xmx6g /tools/picard-tools-1.32/FixMateInformation.jar \ INPUT=/output/filename.b37_1kg.realigned.bam \ OUTPUT=/output/filename.b37_1kg.matefixed.bam \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=SILENT \ TMP_DIR=/local
GATK Quality Score Recalibration
- CountCovariates and TableRecalibration are no longer options in GATK (http://gatkforums.broadinstitute.org/discussion/1248/countcovariates).
- Use BaseRecalibrator for both of the below step now (http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_bqsr_BaseRecalibrator.html)
java [jvm_args] -jar GenomeAnalysisTK.jar \ -T CountCovariates \ -R $reference_fasta \ -I $realigned_bam_file \ -recalFile recal_data.csv \ -knownSites $known_sites_file(s) \ -l INFO \ -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \
java [jvm_args] -jar GenomeAnalysisTK.jar \ -T TableRecalibration \ -R $reference_fasta \ -recalFile recal_data.csv \ -I $realigned_bam_file \ -o $recalibrated_bam_file \ -l INFO \ -compress 0 \ -disable_bam_indexing \
Known sites for recalibration are from dbSNP135:
Sort and index recalibrated alignment (~5h)
java -jar -Xmx3g /tools/picard-tools-1.32/SortSam.jar \ INPUT=/output/filename.b37_1kg.recal.bam \ OUTPUT=/output/filename.b37_1kg.recal.sorted.bam \ SORT_ORDER=coordinate \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
java -jar -Xmx3g /tools/picard-tools-1.32/BuildBamIndex.jar \ INPUT=/output/filename.b37_1kg.recal.sorted.bam \ OUTPUT=/output/filename.b37_1kg.recal.sorted.bam.bai \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
Calculate covariates after realignment and recalibration
java -jar -Xmx2g GenomeAnalysisTK.jar \ -l INFO \ -T CountCovariates \ -U ALLOW_UNINDEXED_BAM \ -R /resources//hg19/indices/b37_1kg.fa \ --DBSNP /resources//hg19/dbsnp/dbsnp_129_b37_b37_1kg.rod \ -I /output/filename.b37_1kg.recal.sorted.bam \ -cov ReadGroupcovariate \ -cov QualityScoreCovariate \ -cov CycleCovariate \ -cov DinucCovariate \ -recalFile /output/filename.b37_1kg.recal.covariate_table.csv
Analyze covariates before and after
java -jar -Xmx4g AnalyzeCovariates.jar \ -l INFO \ -resources /resources//hg19/indices/b37_1kg.fa \ --recal_file /output/filename.b37_1kg.matefixed.covariate_table.csv \ -outputDir /output/filename.b37_1kg.recal.stats_before/ \ -Rscript ${rscript} -ignoreQ 5
java -jar -Xmx4g AnalyzeCovariates.jar \ -l INFO \ -resources /resources//hg19/indices/b37_1kg.fa \ --recal_file /output/filename.b37_1kg.recal.covariate_table.csv \ -outputDir /output/filename.b37_1kg.recal.stats_after/ \ -Rscript ${rscript} \ -ignoreQ 5
Generate the SAM MD Tag
The MD tag in SAM files provides a string representation of mismatching positions. The CIGAR string used in SAM files does not provide full detail of mismatch positions. The samtools calmd command is used to generate MD tags as well as which fix the NM tags (Edit distance to the reference) and introduce the BQ tags (Base Alignment Quality) which can be used during variant calling.
samtools calmd -Erb recalibrated_file.bam $reference.fasta > $bq_file.bam
Strip Extraneous BAM Tags
Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them, to reduce total file size by around 30%.
GATK ReduceReads ???
Merge BAMS
Merge all BAMS for a given individual
java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT
Mark PCR Duplicates
Picard MarkDuplicates is used to mark PCR Duplicates
java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT
java -Xmx4g -jar /tools/picard-tools-1.32/MarkDuplicates.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ OUTPUT=/output/filename.b37_1kg.dedup.bam \ METRICS_FILE=/output/filename.b37_1kg.dedup.metrics \ REMOVE_DUPLICATES=false ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
Merge All BAMS for the Project
This happens above... Picard MergeSamFiles is used to Merge all the SAM files for a given project.
java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT
BAM Quality Control
- Picard Tools
- BamIndexStats
- CalculateHsMetrics ??
- CleanSam
- CollectAlignmentSummaryMetrics
- CollectGcBiasMetrics
- CollectInsertSizeMetrics
- CollectMultipleMetrics
- CollectTargetedPcrMetrics
- EstimateLibraryComplexity
- FixMateInformation
- MarkDuplicates
- MeanQualityByCycle
- QualityScoreDistribution
- ValidateSamFile
- GATK QC
- [1]
- [2]
CollectAlignmentSummaryMetrics
java -jar -Xmx4g /tools/picard-tools-1.32/CollectAlignmentSummaryMetrics.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.AlignmentSummaryMetrics \ R=/resources//hg19/indices/b37_1kg.fa \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
CollectGcBiasMetrics
java -jar /tools/picard-tools-1.32/CollectGcBiasMetrics.jar \ R=/resources//hg19/indices/b37_1kg.fa \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.GcBiasMetrics \ CHART=/output/filename.b37_1kg.GcBiasMetrics.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
CollectInsertSizeMetrics
java -jar /tools/picard-tools-1.32/CollectInsertSizeMetrics.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.CollectInsertSizeMetrics \ H=/output/filename.b37_1kg.CollectInsertSizeMetrics.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
MeanQualityByCycle
java -jar /tools/picard-tools-1.32/MeanQualityByCycle.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.MeanQualityByCycle \ CHART=/output/filename.b37_1kg.MeanQualityByCycle.pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
QualityScoreDistribution
java -jar /tools/picard-tools-1.32/QualityScoreDistribution.jar \ I=/output/filename.b37_1kg.sorted.bam \ O=/output/filename.b37_1kg.[wiki:QualityScoreDistribution] \ CHART=/output/filename.b37_1kg.[wiki:QualityScoreDistribution].pdf \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
BamIndexStats
java -jar /tools/picard-tools-1.32/BamIndexStats.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
CalculateHsMetricsWholeGenome
java -jar -Xmx3g /tools/picard-tools-1.32/CalculateHsMetricsWholeGenome.jar \ INPUT=/output/filename.b37_1kg.sorted.bam \ OUTPUT=/output/filename.b37_1kg.HsMetrics \ BAIT_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ TARGET_INTERVALS=/resources//hg19/intervals/GoNL.interval_list \ VALIDATION_STRINGENCY=LENIENT \ TMP_DIR=/local
Split BAMS by Chromosome
BAMs are split into chromosomes BAMs. These files move on to variant calling.
Compress BAMS with ReduceRead
Variant Calling
UnifiedGenotyper
java -jar GenomeAnalysisTK.jar \ -R resources/Homo_sapiens_assembly18.fasta \ -T UnifiedGenotyper \ -I sample1.bam [-I sample2.bam ...] \ --dbsnp dbSNP.vcf \ -o snps.raw.vcf \ -stand_call_conf [50.0] \ -stand_emit_conf 10.0 \ -dcov [50 for 4x, 200 for >30x WGS or Whole exome] \ [-L targets.interval_list]
VariantRecalibrator
java -Xmx4g -jar GenomeAnalysisTK.jar \ -T VariantRecalibrator \ -R reference/human_g1k_v37.fasta \ -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.sites.vcf \ -resource:omni,known=false,training=true,truth=false,prior=12.0 1000G_omni2.5.b37.sites.vcf \ -resource:dbsnp,known=true,training=false,truth=false,prior=6.0 dbsnp_135.b37.vcf \ -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ -an InbreedingCoeff \ -mode SNP \ -recalFile path/to/output.recal \ -tranchesFile path/to/output.tranches \ -rscriptFile path/to/output.plots.R
ApplyRecalibration
java -Xmx3g -jar GenomeAnalysisTK.jar \ -T ApplyRecalibration \ -R reference/human_g1k_v37.fasta \ -input NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.b37.vcf \ --ts_filter_level 99.0 \ -tranchesFile path/to/output.tranches \ -recalFile path/to/output.recal \ -mode SNP \ -o path/to/output.recalibrated.filtered.vcf