Revision as of 22:03, 12 October 2015

Utah Genome Project

Oct. 2015
Variant Calling Pipeline Version 1.4.0

Software Versions

UGPp is a lightweight NGS pipeline, created for the  Utah Genome Project (UGP)
BWA: 0.7.10-r789
GATK: 3.3-0-g37228af
SamTools: 1.2-192-gfcaafe0 (using htslib 1.2.1-194-gf859e8d)
Samblaster: Version 0.1.21
Sambamba: v0.5.1
FastQC: v0.11.2
Tabix: 0.2.5 (r1005)
WHAM: v1.7.0-160-g2a51

Data Source

Data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37) Now includes phix sequence.

human_g1k_v37_decoy.fasta

Call region file generated from NCBI

GRCh37 GFF3

VCF files for RealignerTargetCreator knowns and dbsnp for BaseRecalibrator.

known_indel: Mills_and_1000G_gold_standard.indels.b37.vcf
known_indel: 1000G_phase1.indels.b37.vcf
known_dbsnp: dbsnp_137.b37.vcf

Background Files

We have created 1000Genomes (BWA mem/GATK 3.0+) background files to be ran concurrently with the GenotypeGVCFs step.

Groups Currently completed:

CEU
GBR
FIN

Version 1.0.5 background files have been updated to show only the indviduals of each group, not the file names.

This is a complete list of the background individuals for run completed > 1.0.5 [1]

If you would like a copy of the current files, we have made a public AWS s3 bucket
Using s3cmd execute the following command: 
s3cmd get s3://ugp-1k-backgrounds --recursive

Alternatively to access the files without s3cmd the following use the following URLs:
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/CEU_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/FIN_mergeGvcf.vcf.idx
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf
http://s3-us-west-2.amazonaws.com/ugp-1k-backgrounds/GBR_mergeGvcf.vcf.idx

Resource files for VariantRecalibrator_SNP

hapmap_3.3.b37.vcf
1000G_omni2.5.b37.vcf
1000G_phase1.snps.high_confidence.b37.vcf

Resource files for VariantRecalibrator_INDEL

Mills_and_1000G_gold_standard.indels.b37.vcf
1000G_phase1.indels.b37.vcf

Indexing

The following indexing is required using BWA, Picard and SamTools. GATK requires all three. However this step only needs to be done once "per-machine".

BWA

bwa index -a bwtsw human_g1k_v37_decoy.fasta

Picard

java -jar CreateSequenceDictionary.jar R=human_g1k_v37_decoy.fasta O=human_g1k_v37_decoy.dic

SamTools

samtools faidx human_g1k_v37_decoy.fasta

Fasta Sequences

The fasta file currently used is (including decoys):

human_g1k_v37_decoy.fasta

A list of headers can be found here [2]

FastQ File Analyses

fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.

bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin | 
sambamba sort -m 50G -o *_#_sorted_Dedup.bam

The --addMateTags has been added to samblaster for lumpy [3] compatibility.

BAM File Analyses

samtools stats has now replaced idxstats.

samtools stats [sorted dedup bam files] > dedup-bamfile.stats

flagstat

/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/

samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#

BAM Merging

sambamba

Merging lane BAM files into single individual BAM file.

/scratch/ucgd/lustre/u0413537/software/sambamba/ 
sambamba merge  
--nthreads # 
individual_#_sorted_Dedup_merged.bam 
[individual_#_sorted_Dedup.bam files]

Difference between revisions of "UGP Variant Pipeline 1.4.0"

Revision as of 22:03, 12 October 2015

Contents

Utah Genome Project

Software Versions

Data Source

Indexing

Fasta Sequences

FastQ File Analyses

Alignment

BAM File Analyses

BAM Merging

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 80: / Line 80: @@
 The fasta file currently used is (including decoys):
 *human_g1k_v37_decoy.fasta
-*A list of headers can be found here [https://github.com/srynobio/UGPp/blob/gh-pages/fasta_sequences_regions%20and%20decoys]
+A list of headers can be found here [https://github.com/srynobio/UGPp/blob/gh-pages/fasta_sequences_regions%20and%20decoys]
+== FastQ File Analyses ==
+ fastqc --threads # -o <output dir> -f fastq fastq-file.fq.gz 2> fastqc_run.log-#
+== Alignment ==
+Align reads to the genome with bwa.
+The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data.
+ bwa mem -R "read group" human_g1k_v37_decoy.fasta Sample1_L1_R1.fq Sample1_L1_R2.fq | samblaster --addMateTags |sambamba view -f bam -l 0 -S /dev/stdin |
+ sambamba sort -m 50G -o *_#_sorted_Dedup.bam
+ The --addMateTags has been added to samblaster for lumpy [https://github.com/arq5x/lumpy-sv] compatibility.
+== BAM File Analyses ==
+*samtools stats has now replaced idxstats.
+ samtools stats [sorted dedup bam files] > dedup-bamfile.stats
+*flagstat
+/scratch/ucgd/lustre/u0413537/software/samtools-0.1.19/
+ samtools flagstat _#_sorted_Dedup.bam > _#_sorted_Dedup.flagstat 2> flagstat.log-#
+== BAM Merging ==
+*sambamba
+''Merging lane BAM files into single individual BAM file.''
+ /scratch/ucgd/lustre/u0413537/software/sambamba/
+ sambamba merge
+ --nthreads #
+ individual_#_sorted_Dedup_merged.bam
+ [individual_#_sorted_Dedup.bam files]