Difference between revisions of "UGP Variant Pipeline 1.4.0"

From Utah Genome Project Wiki
Jump to navigation Jump to search
Line 97: Line 97:
 
  sambamba sort -m 50G -o *_#_sorted_Dedup.bam
 
  sambamba sort -m 50G -o *_#_sorted_Dedup.bam
  
The --addMateTags has been added to samblaster for lumpy [https://github.com/arq5x/lumpy-sv] compatibility.
+
"The --addMateTags has been added to samblaster for lumpy [https://github.com/arq5x/lumpy-sv] compatibility."
  
 
== BAM File Analyses ==
 
== BAM File Analyses ==

Revision as of 22:04, 12 October 2015

Utah Genome Project

Oct. 2015
Variant Calling Pipeline Version 1.4.0

Software Versions

  • UGPp is a lightweight NGS pipeline, created for the
 Utah Genome Project (UGP)
  • BWA: 0.7.10-r789
  • GATK: 3.3-0-g37228af
  • SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
  • Samblaster: Version 0.1.21
  • Sambamba: v0.5.1
  • FastQC: v0.11.2
  • Tabix: 0.2.5 (r1005)
  • WHAM: v1.7.0-160-g2a51

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37) Now includes phix sequence.

  • human_g1k_v37_decoy.fasta

Call region file generated from NCBI

  • GRCh37 GFF3

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.

  • known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
  • known_indel: 1000G_phase1.indels.b37.vcf
  • known_dbsnp: dbsnp_137.b37.vcf

Background Files

  • We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.

Groups Currently completed:

  • CEU
  • GBR
  • FIN

Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.

This is a complete list of the background individuals for run completed > 1.0.5 [1]

If you would like a copy of the current files, we have made a public AWS s3 bucket
Using s3cmd execute the following command: 
s3cmd get s3://ugp-1k-backgrounds --recursive
Alternatively to access the files without s3cmd the following use the following URLs:
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx

Resource files for VariantRecalibrator_SNP

  • hapmap_3.3.b37.vcf
  • 1000G_omni2.5.b37.vcf
  • 1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL

  • Mills_and_1000G_gold_standard.indels.b37.vcf
  • 1000G_phase1.indels.b37.vcf

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

  • BWA
bwa index -a bwtsw human_g1k_v37_decoy.fasta
  • Picard
java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic
  • SamTools
samtools faidx human_g1k_v37_decoy.fasta

Fasta Sequences

The fasta file currently used is (including decoys):

  • human_g1k_v37_decoy.fasta

A list of headers can be found here [2]

FastQ File Analyses

fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | 
sambamba sort -m 50G -o *_#_sorted_Dedup.bam

"The --addMateTags has been added to samblaster for lumpy [3] compatibility."

BAM File Analyses

  • samtools stats has now replaced idxstats.
samtools stats [sorted dedup bam files] > dedup-bamfile.stats
  • flagstat

/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/

samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#

BAM Merging

  • sambamba

Merging lane BAM files into single individual BAM file.

/scratch/ucgd/lustre/u0413537/software/sambamba/ 
sambamba merge  
--nthreads # 
individual_#_sorted_Dedup_merged.bam 
[individual_#_sorted_Dedup.bam files]