UGP Variant Pipeline 1.4.0
Contents
Utah Genome Project
Oct. 2015 Variant Calling Pipeline Version 1.4.0
Software Versions
- UGPp is a lightweight NGS pipeline, created for the Utah Genome Project (UGP)
- BWA: 0.7.10-r789
- GATK: 3.3-0-g37228af
- SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
- Samblaster: Version 0.1.21
- Sambamba: v0.5.1
- FastQC: v0.11.2
- Tabix: 0.2.5 (r1005)
- WHAM: v1.7.0-160-g2a51
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
Reference Genome (GRCh37) Now includes phix sequence.
- human_g1k_v37_decoy.fasta
Call region file generated from NCBI
- GRCh37 GFF3
VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: 1000G_phase1.indels.b37.vcf
- known_dbsnp: dbsnp_137.b37.vcf
Background Files
- We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
Groups Currently completed:
- CEU
- GBR
- FIN
Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.
This is a complete list of the background individuals for run completed > 1.0.5 [1]
If you would like a copy of the current files, we have made a public AWS s3 bucket Using s3cmd execute the following command: s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs: http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx
Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf
Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf
Indexing
The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".
- BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
- Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
- SamTools
samtools faidx human_g1k_v37_decoy.fasta
Fasta Sequences
The fasta file currently used is (including decoys):
- human_g1k_v37_decoy.fasta
- A list of headers can be found here [2]