UGP Variant Pipeline 1.4.0

From Utah Genome Project Wiki
Jump to navigation Jump to search

Utah Genome Project

Oct. 2015
Variant Calling Pipeline Version 1.4.0

Software Versions

  • UGPp is a lightweight NGS pipeline, created for the
 Utah Genome Project (UGP)
  • BWA: 0.7.10-r789
  • GATK: 3.3-0-g37228af
  • SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
  • Samblaster: Version 0.1.21
  • Sambamba: v0.5.1
  • FastQC: v0.11.2
  • Tabix: 0.2.5 (r1005)
  • WHAM: v1.7.0-160-g2a51

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37) Now includes phix sequence.

  • human_g1k_v37_decoy.fasta

Call region file generated from NCBI

  • GRCh37 GFF3

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.

  • known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
  • known_indel: 1000G_phase1.indels.b37.vcf
  • known_dbsnp: dbsnp_137.b37.vcf

Background Files

  • We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.

Groups Currently completed:

  • CEU
  • GBR
  • FIN

Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.

This is a complete list of the background individuals for run completed > 1.0.5 [1]

If you would like a copy of the current files, we have made a public AWS s3 bucket
Using s3cmd execute the following command: 
s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs:
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx

Resource files for VariantRecalibrator_SNP

  • hapmap_3.3.b37.vcf
  • 1000G_omni2.5.b37.vcf
  • 1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL

  • Mills_and_1000G_gold_standard.indels.b37.vcf
  • 1000G_phase1.indels.b37.vcf

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

  • BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
  • Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
  • SamTools
samtools faidx human_g1k_v37_decoy.fasta

Fasta Sequences

The fasta file currently used is (including decoys):

  • human_g1k_v37_decoy.fasta

A list of fasta headers can be found here [2]

FastQ File Analyses

fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | 
sambamba sort -m 50G -o *_#_sorted_Dedup.bam /dev/stdin

The --addMateTags has been added to samblaster for lumpy [3] compatibility.

BAM File Analyses

  • samtools stats has now replaced idxstats.
samtools stats [sorted dedup bam files] > dedup-bamfile.stats
  • flagstat

/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/

samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#

BAM Merging

  • sambamba

Merging lane BAM files into single individual BAM file.

sambamba merge  
--nthreads # 
individual_#_sorted_Dedup_merged.bam 
[individual_#_sorted_Dedup.bam files]
  • RealignerTargetCreator

This is where the tasks are broken into chromosomal regions.

java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T RealignerTargetCreator 
   -R human_g1k_v37_decoy.fasta 
   -I *sorted_Dedup_merged.bam 
   --num_threads #  
   --known Mills_and_1000G_gold_standard.indels.b37.vcf
   --known 1000G_phase1.indels.b37.vcf  
   -L chr#_region_file.list 
   -o *_chr#_realign.intervals 
   2> RealignerTargetCreator.log-#

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • IndelRealigner
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T IndelRealigner 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged.bam
   -L chr#_region_file.list
   -targetIntervals *_chr#_realign.intervals 
   -known Mills_and_1000G_gold_standard.indels.b37.vcf 
   -known 1000G_phase1.indels.b37.vcf 
   -o *_sorted_Dedup_merged_realign_chr#.bam 
   2> IndelRealigner.log-1

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • BaseRecalibrator
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T BaseRecalibrator 
   -R /human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged_realign_chr#.bam 
   --num_cpu_threads_per_data_thread #  
   --knownSites dbsnp_137.b37.vcf
   --knownSites Mills_and_1000G_gold_standard.indels.b37.vcf 
   --knownSites 1000G_phase1.indels.b37.vcf  
   -o *_sorted_Dedup_merged_realign_chr#_recal_data.table 
   &> BaseRecalibrator.log-#

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • PrintReads
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T PrintReads 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged_realign_chr#.bam
   --num_cpu_threads_per_data_thread #  
   -BQSR *_sorted_Dedup_merged_realign_chr#_recal_data.table 
   -o *_sorted_Dedup_merged_realign_chr#_recal.bam 
   &> PrintReads.log-#

GATK Variant Calling

  • HaplotypeCaller

Now HaplotypeCaller handels SNP and INDEL calls.

java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T HaplotypeCaller 
   -R human_g1k_v37_decoy.fasta 
   --min_base_quality_score 20 
   --variant_index_parameter 128000 
   --emitRefConfidence GVCF 
   --standard_min_confidence_threshold_for_calling 30.0 
   --num_cpu_threads_per_data_thread # 
   --variant_index_type LINEAR 
   --standard_min_confidence_threshold_for_emitting 30.0  
   -I *_sorted_Dedup_merged_realign_chr*_recal.bam 
   -L chr#_region_file.list
   -o chr#_region_*.raw.snps.indels.gvcf 
   &> HaplotypeCaller.log-#

Notes: Do we need --variant_index_parameter 128000, --variant_index_type LINEAR, are min_confidence_thresholds set too high?

  • CatVariants

Collect all individual gvcf files. These will be the files kept for GenotypeGVCF and future GenotypeGVCF re-runs.

Special note here: the final "gvcf" are called gCat.vcf.

java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants 
   -R human_g1k_v37_decoy.fasta 
   --assumeSorted  
   -V chr#_region_*.raw.snps.indels.vcf 
   -V [ ... ]
   -out <individual>.raw.snps.indels.gCat.vcf 
   &> CatVariants.log-*
  • CombineGVCFs

This step will only be ran on > 200 individual gCat call sets.

java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar  
   -T CombineGVCFs 
   -R human_g1k_v37_decoy.fasta 
   --variant *.raw.snps.indels.gCat.vcf
   --variant [ ... ]
   -o UGPp_*_mergeGvcf.vcf 
   &> CombineGVCF.log-*
  • GenotypeGVCFs
java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar 
   -T GenotypeGVCFs 
   -R human_g1k_v37_decoy.fasta 
   --num_threads #  
   --variant [UGPp_*_mergeGvcf.vcf or <individual>.raw.snps.indels.gCat.vcf]
   --variant CEU_mergeGvcf.vcf 
   --variant GBR_mergeGvcf.vcf 
   --variant FIN_mergeGvcf.vcf 
   -L chr#_region_file.list
   -o chr#_region_genotyped.vcf
   &> GenotypeGVCF.log-#
  • Combine_Genotyped

Collect all the individual genotype steps.

java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants 
   -R human_g1k_v37_decoy.fasta 
   --assumeSorted  
   -V chr#_region_genotyped.vcf
   -V [ ... ] 
   -out UGPp_*_cat_genotyped.vcf 
   &> Combine_Genotyped.log-#

VariantRecalibrator

  • SNP Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=tmp GenomeAnalysisTK.jar  
   -T VariantRecalibrator 
   -R human_g1k_v37_decoy.fasta 
   --minNumBadVariants 5000 
   --num_threads #  
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf 
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf 
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf 
   -an QD 
   -an MQRankSum 
   -an ReadPosRankSum 
   -an FS 
   -an InbreedingCoeff 
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_snp_recal 
   -tranchesFile UGPp_*_snp_tranches 
   -rscriptFile UGPp_*_snp_plots.R 
   -mode SNP 
   &> VariantRecalibrator_SNP.log-#
  • INDEL Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T VariantRecalibrator 
   -R human_g1k_v37_decoy.fasta 
   --minNumBadVariants 5000 
   --num_threads #  
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf 
   -resource:1000G,known=false,training=true,truth=true,prior=10.0 1000G_phase1.indels.b37.vcf 
   -an MQRankSum 
   -an ReadPosRankSum 
   -an FS 
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_indel_recal 
   -tranchesFile UGPp_*_indel_tranches 
   -rscriptFile UGPp_*_indel_plots.R 
   -mode INDEL 
   &> VariantRecalibrator_INDEL.log-#

ApplyRecalibration

  • SNP Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T ApplyRecalibration 
   -R human_g1k_v37_decoy.fasta 
   --ts_filter_level 99.5 
   --excludeFiltered 
   --num_threads #  
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_snp_recal 
   -tranchesFile UGPp_*_snp_tranches 
   -mode SNP 
   -o UGPp_*_recal_SNP.vcf 
   &> ApplyRecalibration_SNP.log-*
  • INDEL Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T ApplyRecalibration 
   -R human_g1k_v37_decoy.fasta 
   --ts_filter_level 99.0 
   --excludeFiltered 
   --num_threads #  
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_indel_recal 
   -tranchesFile UGPp_*_indel_tranches 
   -mode INDEL 
   -o UGPp_*_recal_INDEL.vcf 
   &> ApplyRecalibration_INDEL.log-*

CombineVarients

These command will combine INDEL and SNP files into a single VCF file.

  • CombineVarients
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T CombineVariants 
   -R human_g1k_v37_decoy.fasta 
   --num_threads # 
   --genotypemergeoption UNSORTED 
   --variant UGPp_*_recal_SNP.vcf  
   --variant UGPp_*_recal_INDEL.vcf  
   -o UGPp_*_Final+Backgrouds.vcf 
   &> CombineVariants.log-*

Tabix

A tabix index file is now be generated.

bgzip -c UGPp_*_Final+Backgrouds.vcf > UGPp_*_Final+Backgrouds.vcf.gz
tabix -p vcf UGPp_*_Final+Backgrouds.vcf.gz

BAM Merge

This step uses sambamba to merge split chromosome BAMs into a single file.

sambamba merge -t #
individual_sorted_Dedup_realign_recal.bam
[individual_sorted_Dedup_realign_recal_chr#.bam]
[...]
2> sambamba_bam_merge.log-#

WHAM

The first SV caller we have added is WHAM [4].

  • WHAM-GRAPHENING

First we generate individual level non-genotyped calls.

WHAM-GRAPHENING -a 
human_g1k_v37_decoy.fasta 
-k 
-x 4 
-f individual_sorted_Dedup_realign_recal_chr#.bam 
> individual_WHAM.vcf 
2> wham_graphing.log-#
  • Concatenate WHAM VCF

Next all individuals are concatenated together.

cat
[individual_WHAM.vcf]
[...]
> UGPp_*_merged.WHAM.vcf
  • Sort WHAM VCF

Then merged vcf file are sorted.

grep -v '^#' 
UGPp_*_merged.WHAM.vcf 
| sort -k1,1 -k2,2n 
> UGPp_*_merged_sorted.WHAM.vcf
  • WHAM mergeIndvs

Finally all individuals are merge together.

mergeIndvs 
-f UGPp_*_merged_sorted.WHAM.vcf
> UGPp_*_mergeIndvs.WHAM.vcf

At this point WHAM calls would be genotyped, but for version 1.4.0 it would be considered an analysis step.