UGP Variant Pipeline 1.4.0
Contents
Utah Genome Project
Oct. 2015 Variant Calling Pipeline Version 1.4.0
Software Versions
- UGPp is a lightweight NGS pipeline, created for the Utah Genome Project (UGP)
- BWA: 0.7.10-r789
- GATK: 3.3-0-g37228af
- SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
- Samblaster: Version 0.1.21
- Sambamba: v0.5.1
- FastQC: v0.11.2
- Tabix: 0.2.5 (r1005)
- WHAM: v1.7.0-160-g2a51
Data Source
Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8
wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
Reference Genome (GRCh37) Now includes phix sequence.
- human_g1k_v37_decoy.fasta
Call region file generated from NCBI
- GRCh37 GFF3
VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.
- known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
- known_indel: 1000G_phase1.indels.b37.vcf
- known_dbsnp: dbsnp_137.b37.vcf
Background Files
- We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.
Groups Currently completed:
- CEU
- GBR
- FIN
Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.
This is a complete list of the background individuals for run completed > 1.0.5 [1]
If you would like a copy of the current files, we have made a public AWS s3 bucket Using s3cmd execute the following command: s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs: http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx
Resource files for VariantRecalibrator_SNP
- hapmap_3.3.b37.vcf
- 1000G_omni2.5.b37.vcf
- 1000G_phase1.snps.high_confidence.b37.vcf
Resource files for VariantRecalibrator_INDEL
- Mills_and_1000G_gold_standard.indels.b37.vcf
- 1000G_phase1.indels.b37.vcf
Indexing
The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".
- BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
- Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
- SamTools
samtools faidx human_g1k_v37_decoy.fasta
Fasta Sequences
The fasta file currently used is (including decoys):
- human_g1k_v37_decoy.fasta
A list of headers can be found here [2]
FastQ File Analyses
fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#
Alignment
Align reads to the genome with bwa.
The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | sambamba sort -m 50G -o *_#_sorted_Dedup.bam
The --addMateTags has been added to samblaster for lumpy [3] compatibility.
BAM File Analyses
- samtools stats has now replaced idxstats.
samtools stats [sorted dedup bam files] > dedup-bamfile.stats
- flagstat
/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/
samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#
BAM Merging
- sambamba
Merging lane BAM files into single individual BAM file.
/scratch/ucgd/lustre/u0413537/software/sambamba/ sambamba merge --nthreads # individual_#_sorted_Dedup_merged.bam [individual_#_sorted_Dedup.bam files]