UGP Variant Pipeline 1.4.0
Contents
Utah Genome Project
Oct. 2015 Variant Calling Pipeline Version 1.4.0
Software Versions
- UGPp is a lightweight NGS pipeline, created for the Utah Genome Project (UGP)
- BWA: 0.7.10-r789
- GATK: 3.3-0-g37228af
- SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
- Samblaster: Version 0.1.21
- Sambamba: v0.5.1
- FastQC: v0.11.2
- Tabix: 0.2.5 (r1005)
- WHAM: v1.7.0-160-g2a51
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
Reference Genome (GRCh37) Now includes phix sequence.
- human_g1k_v37_decoy.fasta
Call region file generated from NCBI
- GRCh37 GFF3
VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: 1000G_phase1.indels.b37.vcf
- known_dbsnp: dbsnp_137.b37.vcf
Background Files
- We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
Groups Currently completed:
- CEU
- GBR
- FIN
Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.
This is a complete list of the background individuals for run completed > 1.0.5 [1]
If you would like a copy of the current files, we have made a public AWS s3 bucket Using s3cmd execute the following command: s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs: http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx
Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf
Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf
Indexing
The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".
- BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
- Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
- SamTools
samtools faidx human_g1k_v37_decoy.fasta
Fasta Sequences
The fasta file currently used is (including decoys):
- human_g1k_v37_decoy.fasta
A list of fasta headers can be found here [2]
FastQ File Analyses
fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#
Alignment
Align reads to the genome with bwa.
The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | sambamba sort -m 50G -o *_#_sorted_Dedup.bam /dev/stdin
The --addMateTags has been added to samblaster for lumpy [3] compatibility.
BAM File Analyses
- samtools stats has now replaced idxstats.
samtools stats [sorted dedup bam files] > dedup-bamfile.stats
- flagstat
/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/
samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#
BAM Merging
- sambamba
Merging lane BAM files into single individual BAM file.
sambamba merge --nthreads # individual_#_sorted_Dedup_merged.bam [individual_#_sorted_Dedup.bam files]
- RealignerTargetCreator
This is where the tasks are broken into chromosomal regions.
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T RealignerTargetCreator -R human_g1k_v37_decoy.fasta -I *sorted_Dedup_merged.bam --num_threads # --known Mills_and_1000G_gold_standard.indels.b37.vcf --known 1000G_phase1.indels.b37.vcf -L chr#_region_file.list -o *_chr#_realign.intervals 2> RealignerTargetCreator.log-#
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
- IndelRealigner
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T IndelRealigner -R human_g1k_v37_decoy.fasta -I *_sorted_Dedup_merged.bam -L chr#_region_file.list -targetIntervals *_chr#_realign.intervals -known Mills_and_1000G_gold_standard.indels.b37.vcf -known 1000G_phase1.indels.b37.vcf -o *_sorted_Dedup_merged_realign_chr#.bam 2> IndelRealigner.log-1
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
- BaseRecalibrator
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T BaseRecalibrator -R /human_g1k_v37_decoy.fasta -I *_sorted_Dedup_merged_realign_chr#.bam --num_cpu_threads_per_data_thread # --knownSites dbsnp_137.b37.vcf --knownSites Mills_and_1000G_gold_standard.indels.b37.vcf --knownSites 1000G_phase1.indels.b37.vcf -o *_sorted_Dedup_merged_realign_chr#_recal_data.table &> BaseRecalibrator.log-#
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
- PrintReads
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T PrintReads -R human_g1k_v37_decoy.fasta -I *_sorted_Dedup_merged_realign_chr#.bam --num_cpu_threads_per_data_thread # -BQSR *_sorted_Dedup_merged_realign_chr#_recal_data.table -o *_sorted_Dedup_merged_realign_chr#_recal.bam &> PrintReads.log-#
GATK Variant Calling
- HaplotypeCaller
Now HaplotypeCaller handels SNP and INDEL calls.
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T HaplotypeCaller -R human_g1k_v37_decoy.fasta --min_base_quality_score 20 --variant_index_parameter 128000 --emitRefConfidence GVCF --standard_min_confidence_threshold_for_calling 30.0 --num_cpu_threads_per_data_thread # --variant_index_type LINEAR --standard_min_confidence_threshold_for_emitting 30.0 -I *_sorted_Dedup_merged_realign_chr*_recal.bam -L chr#_region_file.list -o chr#_region_*.raw.snps.indels.gvcf &> HaplotypeCaller.log-#
Notes: Do we need --variant_index_parameter 128000, --variant_index_type LINEAR, are min_confidence_thresholds set too high?
- CatVariants
Collect all individual gvcf files. These will be the files kept for GenotypeGVCF and future GenotypeGVCF re-runs.
Special note here: the final "gvcf" are called gCat.vcf.
java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R human_g1k_v37_decoy.fasta --assumeSorted -V chr#_region_*.raw.snps.indels.vcf -V [ ... ] -out <individual>.raw.snps.indels.gCat.vcf &> CatVariants.log-*
- CombineGVCFs
This step will only be ran on > 200 individual gCat call sets.
java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37_decoy.fasta --variant *.raw.snps.indels.gCat.vcf --variant [ ... ] -o UGPp_*_mergeGvcf.vcf &> CombineGVCF.log-*
- GenotypeGVCFs
java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar -T GenotypeGVCFs -R human_g1k_v37_decoy.fasta --num_threads # --variant [UGPp_*_mergeGvcf.vcf or <individual>.raw.snps.indels.gCat.vcf] --variant CEU_mergeGvcf.vcf --variant GBR_mergeGvcf.vcf --variant FIN_mergeGvcf.vcf -L chr#_region_file.list -o chr#_region_genotyped.vcf &> GenotypeGVCF.log-#
- Combine_Genotyped
Collect all the individual genotype steps.
java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R human_g1k_v37_decoy.fasta --assumeSorted -V chr#_region_genotyped.vcf -V [ ... ] -out UGPp_*_cat_genotyped.vcf &> Combine_Genotyped.log-#
VariantRecalibrator
- SNP Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads # -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an InbreedingCoeff -input UGPp_*_cat_genotyped.vcf -recalFile UGPp_*_snp_recal -tranchesFile UGPp_*_snp_tranches -rscriptFile UGPp_*_snp_plots.R -mode SNP &> VariantRecalibrator_SNP.log-#
- INDEL Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T VariantRecalibrator -R human_g1k_v37_decoy.fasta --minNumBadVariants 5000 --num_threads # -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:1000G,known=false,training=true,truth=true,prior=10.0 1000G_phase1.indels.b37.vcf -an MQRankSum -an ReadPosRankSum -an FS -input UGPp_*_cat_genotyped.vcf -recalFile UGPp_*_indel_recal -tranchesFile UGPp_*_indel_tranches -rscriptFile UGPp_*_indel_plots.R -mode INDEL &> VariantRecalibrator_INDEL.log-#
ApplyRecalibration
- SNP Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -R human_g1k_v37_decoy.fasta --ts_filter_level 99.5 --excludeFiltered --num_threads # -input UGPp_*_cat_genotyped.vcf -recalFile UGPp_*_snp_recal -tranchesFile UGPp_*_snp_tranches -mode SNP -o UGPp_*_recal_SNP.vcf &> ApplyRecalibration_SNP.log-*
- INDEL Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T ApplyRecalibration -R human_g1k_v37_decoy.fasta --ts_filter_level 99.0 --excludeFiltered --num_threads # -input UGPp_*_cat_genotyped.vcf -recalFile UGPp_*_indel_recal -tranchesFile UGPp_*_indel_tranches -mode INDEL -o UGPp_*_recal_INDEL.vcf &> ApplyRecalibration_INDEL.log-*
CombineVarients
These command will combine INDEL and SNP files into a single VCF file.
- CombineVarients
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar -T CombineVariants -R human_g1k_v37_decoy.fasta --num_threads # --genotypemergeoption UNSORTED --variant UGPp_*_recal_SNP.vcf --variant UGPp_*_recal_INDEL.vcf -o UGPp_*_Final+Backgrouds.vcf &> CombineVariants.log-*
Tabix
A tabix index file is now be generated.
bgzip -c UGPp_*_Final+Backgrouds.vcf > UGPp_*_Final+Backgrouds.vcf.gz
tabix -p vcf UGPp_*_Final+Backgrouds.vcf.gz
BAM Merge
This step uses sambamba to merge split chromosome BAMs into a single file.
sambamba merge -t # individual_sorted_Dedup_realign_recal.bam [individual_sorted_Dedup_realign_recal_chr#.bam] [...] 2> sambamba_bam_merge.log-#
WHAM
The first SV caller we have added is WHAM [4].
- WHAM-GRAPHENING
First we generate individual level non-genotyped calls.
WHAM-GRAPHENING -a human_g1k_v37_decoy.fasta -k -x 4 -f individual_sorted_Dedup_realign_recal_chr#.bam > individual_WHAM.vcf 2> wham_graphing.log-#
- Concatenate WHAM VCF
Next all individuals are concatenated together.
cat [individual_WHAM.vcf] [...] > UGPp_*_merged.WHAM.vcf
- Sort WHAM VCF
Then merged vcf file are sorted.
grep -v '^#' UGPp_*_merged.WHAM.vcf | sort -k1,1 -k2,2n > UGPp_*_merged_sorted.WHAM.vcf
- WHAM mergeIndvs
Finally all individuals are merge together.
mergeIndvs -f UGPp_*_merged_sorted.WHAM.vcf > UGPp_*_mergeIndvs.WHAM.vcf
At this point WHAM calls would be genotyped, but for version 1.4.0 it would be considered an analysis step.