Difference between revisions of "UGP Variant Pipeline 1.4.0"

From Utah Genome Project Wiki
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 82: Line 82:
  
 
*human_g1k_v37_decoy.fasta  
 
*human_g1k_v37_decoy.fasta  
A list of headers can be found here [https://github.com/srynobio/UGPp/blob/gh-pages/fasta_sequences_regions%20and%20decoys]
+
 
 +
A list of fasta headers can be found here [http://weatherby.genetics.utah.edu/UGP/wiki/index.php/Fasta_sequences_regions_and_decoys]
  
 
== FastQ File Analyses ==
 
== FastQ File Analyses ==
Line 95: Line 96:
  
 
  bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin |  
 
  bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin |  
  sambamba sort -m 50G -o *_#_sorted_Dedup.bam
+
  sambamba sort -m 50G -o *_#_sorted_Dedup.bam /dev/stdin
  
 
''The --addMateTags has been added to samblaster for lumpy [https://github.com/arq5x/lumpy-sv] compatibility.''
 
''The --addMateTags has been added to samblaster for lumpy [https://github.com/arq5x/lumpy-sv] compatibility.''
Line 131: Line 132:
 
     -o *_chr#_realign.intervals  
 
     -o *_chr#_realign.intervals  
 
     2> RealignerTargetCreator.log-#
 
     2> RealignerTargetCreator.log-#
 +
 +
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
  
 
*IndelRealigner
 
*IndelRealigner
Line 144: Line 147:
 
     -o *_sorted_Dedup_merged_realign_chr#.bam  
 
     -o *_sorted_Dedup_merged_realign_chr#.bam  
 
     2> IndelRealigner.log-1
 
     2> IndelRealigner.log-1
 +
 +
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
  
 
*BaseRecalibrator
 
*BaseRecalibrator
Line 157: Line 162:
 
     -o *_sorted_Dedup_merged_realign_chr#_recal_data.table  
 
     -o *_sorted_Dedup_merged_realign_chr#_recal_data.table  
 
     &> BaseRecalibrator.log-#
 
     &> BaseRecalibrator.log-#
 +
 +
Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?
  
 
*PrintReads
 
*PrintReads
Line 188: Line 195:
 
     -o chr#_region_*.raw.snps.indels.gvcf  
 
     -o chr#_region_*.raw.snps.indels.gvcf  
 
     &> HaplotypeCaller.log-#
 
     &> HaplotypeCaller.log-#
 +
 +
Notes: Do we need --variant_index_parameter 128000, --variant_index_type LINEAR, are min_confidence_thresholds set too high?
  
 
*CatVariants
 
*CatVariants
Line 370: Line 379:
 
*Sort WHAM VCF
 
*Sort WHAM VCF
 
''Then merged vcf file are sorted.''
 
''Then merged vcf file are sorted.''
 
At this point WHAM call would be genotyped, but for version 1.4.0 it would be considered an analysis step.
 
  
 
  grep -v '^#'  
 
  grep -v '^#'  
Line 384: Line 391:
 
  -f UGPp_*_merged_sorted.WHAM.vcf
 
  -f UGPp_*_merged_sorted.WHAM.vcf
 
  > UGPp_*_mergeIndvs.WHAM.vcf
 
  > UGPp_*_mergeIndvs.WHAM.vcf
 +
 +
At this point WHAM calls would be genotyped, but for version 1.4.0 it would be considered an analysis step.

Latest revision as of 20:26, 14 July 2016

Utah Genome Project

Oct. 2015
Variant Calling Pipeline Version 1.4.0

Software Versions

  • UGPp is a lightweight NGS pipeline, created for the
 Utah Genome Project (UGP)
  • BWA: 0.7.10-r789
  • GATK: 3.3-0-g37228af
  • SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
  • Samblaster: Version 0.1.21
  • Sambamba: v0.5.1
  • FastQC: v0.11.2
  • Tabix: 0.2.5 (r1005)
  • WHAM: v1.7.0-160-g2a51

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37) Now includes phix sequence.

  • human_g1k_v37_decoy.fasta

Call region file generated from NCBI

  • GRCh37 GFF3

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.

  • known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
  • known_indel: 1000G_phase1.indels.b37.vcf
  • known_dbsnp: dbsnp_137.b37.vcf

Background Files

  • We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.

Groups Currently completed:

  • CEU
  • GBR
  • FIN

Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.

This is a complete list of the background individuals for run completed > 1.0.5 [1]

If you would like a copy of the current files, we have made a public AWS s3 bucket
Using s3cmd execute the following command: 
s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs:
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx

Resource files for VariantRecalibrator_SNP

  • hapmap_3.3.b37.vcf
  • 1000G_omni2.5.b37.vcf
  • 1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL

  • Mills_and_1000G_gold_standard.indels.b37.vcf
  • 1000G_phase1.indels.b37.vcf

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

  • BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
  • Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
  • SamTools
samtools faidx human_g1k_v37_decoy.fasta

Fasta Sequences

The fasta file currently used is (including decoys):

  • human_g1k_v37_decoy.fasta

A list of fasta headers can be found here [2]

FastQ File Analyses

fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | 
sambamba sort -m 50G -o *_#_sorted_Dedup.bam /dev/stdin

The --addMateTags has been added to samblaster for lumpy [3] compatibility.

BAM File Analyses

  • samtools stats has now replaced idxstats.
samtools stats [sorted dedup bam files] > dedup-bamfile.stats
  • flagstat

/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/

samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#

BAM Merging

  • sambamba

Merging lane BAM files into single individual BAM file.

sambamba merge  
--nthreads # 
individual_#_sorted_Dedup_merged.bam 
[individual_#_sorted_Dedup.bam files]
  • RealignerTargetCreator

This is where the tasks are broken into chromosomal regions.

java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T RealignerTargetCreator 
   -R human_g1k_v37_decoy.fasta 
   -I *sorted_Dedup_merged.bam 
   --num_threads #  
   --known Mills_and_1000G_gold_standard.indels.b37.vcf
   --known 1000G_phase1.indels.b37.vcf  
   -L chr#_region_file.list 
   -o *_chr#_realign.intervals 
   2> RealignerTargetCreator.log-#

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • IndelRealigner
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T IndelRealigner 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged.bam
   -L chr#_region_file.list
   -targetIntervals *_chr#_realign.intervals 
   -known Mills_and_1000G_gold_standard.indels.b37.vcf 
   -known 1000G_phase1.indels.b37.vcf 
   -o *_sorted_Dedup_merged_realign_chr#.bam 
   2> IndelRealigner.log-1

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • BaseRecalibrator
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T BaseRecalibrator 
   -R /human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged_realign_chr#.bam 
   --num_cpu_threads_per_data_thread #  
   --knownSites dbsnp_137.b37.vcf
   --knownSites Mills_and_1000G_gold_standard.indels.b37.vcf 
   --knownSites 1000G_phase1.indels.b37.vcf  
   -o *_sorted_Dedup_merged_realign_chr#_recal_data.table 
   &> BaseRecalibrator.log-#

Notes: Should we use more known sites (dbSNP, Exac, H1K sites)?

  • PrintReads
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T PrintReads 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_merged_realign_chr#.bam
   --num_cpu_threads_per_data_thread #  
   -BQSR *_sorted_Dedup_merged_realign_chr#_recal_data.table 
   -o *_sorted_Dedup_merged_realign_chr#_recal.bam 
   &> PrintReads.log-#

GATK Variant Calling

  • HaplotypeCaller

Now HaplotypeCaller handels SNP and INDEL calls.

java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T HaplotypeCaller 
   -R human_g1k_v37_decoy.fasta 
   --min_base_quality_score 20 
   --variant_index_parameter 128000 
   --emitRefConfidence GVCF 
   --standard_min_confidence_threshold_for_calling 30.0 
   --num_cpu_threads_per_data_thread # 
   --variant_index_type LINEAR 
   --standard_min_confidence_threshold_for_emitting 30.0  
   -I *_sorted_Dedup_merged_realign_chr*_recal.bam 
   -L chr#_region_file.list
   -o chr#_region_*.raw.snps.indels.gvcf 
   &> HaplotypeCaller.log-#

Notes: Do we need --variant_index_parameter 128000, --variant_index_type LINEAR, are min_confidence_thresholds set too high?

  • CatVariants

Collect all individual gvcf files. These will be the files kept for GenotypeGVCF and future GenotypeGVCF re-runs.

Special note here: the final "gvcf" are called gCat.vcf.

java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants 
   -R human_g1k_v37_decoy.fasta 
   --assumeSorted  
   -V chr#_region_*.raw.snps.indels.vcf 
   -V [ ... ]
   -out <individual>.raw.snps.indels.gCat.vcf 
   &> CatVariants.log-*
  • CombineGVCFs

This step will only be ran on > 200 individual gCat call sets.

java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar  
   -T CombineGVCFs 
   -R human_g1k_v37_decoy.fasta 
   --variant *.raw.snps.indels.gCat.vcf
   --variant [ ... ]
   -o UGPp_*_mergeGvcf.vcf 
   &> CombineGVCF.log-*
  • GenotypeGVCFs
java -jar -Xmx#g -XX:ParallelGCThreads=# GenomeAnalysisTK.jar 
   -T GenotypeGVCFs 
   -R human_g1k_v37_decoy.fasta 
   --num_threads #  
   --variant [UGPp_*_mergeGvcf.vcf or <individual>.raw.snps.indels.gCat.vcf]
   --variant CEU_mergeGvcf.vcf 
   --variant GBR_mergeGvcf.vcf 
   --variant FIN_mergeGvcf.vcf 
   -L chr#_region_file.list
   -o chr#_region_genotyped.vcf
   &> GenotypeGVCF.log-#
  • Combine_Genotyped

Collect all the individual genotype steps.

java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants 
   -R human_g1k_v37_decoy.fasta 
   --assumeSorted  
   -V chr#_region_genotyped.vcf
   -V [ ... ] 
   -out UGPp_*_cat_genotyped.vcf 
   &> Combine_Genotyped.log-#

VariantRecalibrator

  • SNP Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=tmp GenomeAnalysisTK.jar  
   -T VariantRecalibrator 
   -R human_g1k_v37_decoy.fasta 
   --minNumBadVariants 5000 
   --num_threads #  
   -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf 
   -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf 
   -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf 
   -an QD 
   -an MQRankSum 
   -an ReadPosRankSum 
   -an FS 
   -an InbreedingCoeff 
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_snp_recal 
   -tranchesFile UGPp_*_snp_tranches 
   -rscriptFile UGPp_*_snp_plots.R 
   -mode SNP 
   &> VariantRecalibrator_SNP.log-#
  • INDEL Recalibration
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T VariantRecalibrator 
   -R human_g1k_v37_decoy.fasta 
   --minNumBadVariants 5000 
   --num_threads #  
   -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf 
   -resource:1000G,known=false,training=true,truth=true,prior=10.0 1000G_phase1.indels.b37.vcf 
   -an MQRankSum 
   -an ReadPosRankSum 
   -an FS 
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_indel_recal 
   -tranchesFile UGPp_*_indel_tranches 
   -rscriptFile UGPp_*_indel_plots.R 
   -mode INDEL 
   &> VariantRecalibrator_INDEL.log-#

ApplyRecalibration

  • SNP Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T ApplyRecalibration 
   -R human_g1k_v37_decoy.fasta 
   --ts_filter_level 99.5 
   --excludeFiltered 
   --num_threads #  
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_snp_recal 
   -tranchesFile UGPp_*_snp_tranches 
   -mode SNP 
   -o UGPp_*_recal_SNP.vcf 
   &> ApplyRecalibration_SNP.log-*
  • INDEL Apply
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
   -T ApplyRecalibration 
   -R human_g1k_v37_decoy.fasta 
   --ts_filter_level 99.0 
   --excludeFiltered 
   --num_threads #  
   -input UGPp_*_cat_genotyped.vcf 
   -recalFile UGPp_*_indel_recal 
   -tranchesFile UGPp_*_indel_tranches 
   -mode INDEL 
   -o UGPp_*_recal_INDEL.vcf 
   &> ApplyRecalibration_INDEL.log-*

CombineVarients

These command will combine INDEL and SNP files into a single VCF file.

  • CombineVarients
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T CombineVariants 
   -R human_g1k_v37_decoy.fasta 
   --num_threads # 
   --genotypemergeoption UNSORTED 
   --variant UGPp_*_recal_SNP.vcf  
   --variant UGPp_*_recal_INDEL.vcf  
   -o UGPp_*_Final+Backgrouds.vcf 
   &> CombineVariants.log-*

Tabix

A tabix index file is now be generated.

bgzip -c UGPp_*_Final+Backgrouds.vcf > UGPp_*_Final+Backgrouds.vcf.gz
tabix -p vcf UGPp_*_Final+Backgrouds.vcf.gz

BAM Merge

This step uses sambamba to merge split chromosome BAMs into a single file.

sambamba merge -t #
individual_sorted_Dedup_realign_recal.bam
[individual_sorted_Dedup_realign_recal_chr#.bam]
[...]
2> sambamba_bam_merge.log-#

WHAM

The first SV caller we have added is WHAM [4].

  • WHAM-GRAPHENING

First we generate individual level non-genotyped calls.

WHAM-GRAPHENING -a 
human_g1k_v37_decoy.fasta 
-k 
-x 4 
-f individual_sorted_Dedup_realign_recal_chr#.bam 
> individual_WHAM.vcf 
2> wham_graphing.log-#
  • Concatenate WHAM VCF

Next all individuals are concatenated together.

cat
[individual_WHAM.vcf]
[...]
> UGPp_*_merged.WHAM.vcf
  • Sort WHAM VCF

Then merged vcf file are sorted.

grep -v '^#' 
UGPp_*_merged.WHAM.vcf 
| sort -k1,1 -k2,2n 
> UGPp_*_merged_sorted.WHAM.vcf
  • WHAM mergeIndvs

Finally all individuals are merge together.

mergeIndvs 
-f UGPp_*_merged_sorted.WHAM.vcf
> UGPp_*_mergeIndvs.WHAM.vcf

At this point WHAM calls would be genotyped, but for version 1.4.0 it would be considered an analysis step.