Difference between revisions of "FreeBayes Variant Protocol"

From Utah Genome Project Wiki
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
*FastQforward is an ultra parallelized NGS pipeline, created for the
 Utah Genome Project (UGP)
 
*FastQforward is an ultra parallelized NGS pipeline, created for the
 Utah Genome Project (UGP)
 
*BWA: 0.7.10
 
*BWA: 0.7.10
*Samblaster: 0.1.20
+
*SamBlaster: 0.1.20
 
*FreeBayes: 0.9.18
 
*FreeBayes: 0.9.18
 
*SamBamba: 0.5.0
 
*SamBamba: 0.5.0
Line 14: Line 14:
 
=== Data Source ===
 
=== Data Source ===
  
For compatibility with the UGP GATK baseed pipline, all reference data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8
+
For compatibility with UGP's GATK baseed pipline, all reference data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8
  
 
  wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
 
  wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37
Line 66: Line 66:
  
 
== FastQ File Analyses ==
 
== FastQ File Analyses ==
 +
Run separately on each paired FASTQ file.
 +
fastqc pair_1.fq
  
fastqc Sample1_L1_R1.txt
 
  
From the sumamry.txt report we check  
+
From the summary.txt report we check  
 
* FAIL sections
 
* FAIL sections
  
Line 86: Line 87:
  
 
*BWA
 
*BWA
  bwa index -a bwtsw human_g1k_v37_decoy.fasta
+
  bwa index human_g1k_v37_decoy.fasta
  
 
== Alignment ==
 
== Alignment ==
Line 205: Line 206:
 
== Variant Calling ==
 
== Variant Calling ==
  
=== HaplotypeCaller ===
+
=== FreeBayes ===
  
*Now HaplotypeCaller handels SNP and INDEL calls.
+
All sample BAM files as well as background BAM files must be listed (one per line) in a simple text file for FreeBayes to locate them.
 
+
  freebayes -L bam_file_list.txt -v outfile.vcf -f reference.fasta
  java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
+
   
    -T HaplotypeCaller
+
=== VCFfilter ===
    -R human_g1k_v37_decoy.fasta  
+
Freebayes outputs most variants for reference purposes even if they are low quality.  The results must therefore be filtered before using them in a downstream analysis.
    --min_base_quality_score 20
+
  vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" input.vcf > output.vcf
    --variant_index_parameter 128000
 
    --emitRefConfidence GVCF
 
    --standard_min_confidence_threshold_for_calling 30.0
 
    --num_cpu_threads_per_data_thread #
 
    --variant_index_type LINEAR
 
    --standard_min_confidence_threshold_for_emitting 30.0  
 
    -I *_sorted_Dedup_realign_chr*_recal.bam
 
    -L chr#_region_file.list
 
    -o chr#_region_*.raw.snps.indels.gvcf
 
    &> HaplotypeCaller.log-#
 
 
 
== Filtering Varients ==
 
 
 
These command will combine INDEL and SNP files into a single VCF file.
 
*CombineVarients
 
 
 
  java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
    -T CombineVariants
 
    -R human_g1k_v37_decoy.fasta
 
    --num_threads #
 
    --genotypemergeoption UNSORTED
 
    --variant cApTUrE_*_recal_SNP.vcf 
 
    --variant cApTUrE_*_recal_INDEL.vcf 
 
    -o cApTUrE_*_Combined.vcf
 
    &> CombineVariants.log-*
 
 
 
*SelectVariants
 
 
 
java -jar -Xmx#g -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar
 
    -T SelectVariants
 
    -R human_g1k_v37_decoy.fasta
 
    --variant cApTUrE_*_Combined.vcf 
 
    -select "DP > 100"  
 
    -o cApTUrE_*_Final+Backgrounds.vcf  
 
    &> SelectVariants.log-*
 
  
 
== Variant File QC ==
 
== Variant File QC ==

Latest revision as of 18:16, 11 February 2015

Alternate UGP FreeBayes Variant Calling Protocol

Feb. 2015 Variant Calling Pipeline Version 1.0.0

Software Versions

  • FastQforward is an ultra parallelized NGS pipeline, created for the
 Utah Genome Project (UGP)
  • BWA: 0.7.10
  • SamBlaster: 0.1.20
  • FreeBayes: 0.9.18
  • SamBamba: 0.5.0
  • VCFlib: Dec 12, 2014

Data Source

For compatibility with UGP's GATK baseed pipline, all reference data sets used for the variant calling pipeline come from the Broad GSA (GATK) group as the 'GATK resource bundle 2.5' version 2.8

wget -r ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/2.5/b37

Reference Genome (GRCh37):

  • human_g1k_v37_decoy.fasta

Call region file generated from NCBI

  • GRCh37 GFF3

Background Files

  • We have created 1000Genomes background BAM files to be ran concurrently with the FreeBayes step. Created using BWA mem/GATK 3.0+ to allow redundancy of backgrounds across all UGP pipelines

Groups Currently completed:

  • CEU (exome)
  • GBR (exome)
  • FIN (exome)
  • Platinum genomes (whole genome)

This is a complete list of the background individuals for run completed > 1.0.5 [1]

BAM files for backgrounds have not been made public yet, but gVCF files are available via AWS s3 bucket Using s3cmd execute the following command:

s3cmd get s3://ugp-1k-backgrounds --recursive

Alternatively to access the files without s3cmd the following use the following URLs:

Sequencing

This pipeline is designed for 100 bp (or greater) Illumina HiSeq PE exome or WGS sequence data with Sanger/Illumina 1.9 quality encoding, and uses Illumina naming convention found here [2]

Validate File Integrity with md5sum

An md5sum signature should be provided for each FastQ file by the sequencing center. After the file has been downloaded, locally check the md5sum to be sure that no data corruption occurred during the file transfer.

md5sum file.fastq > file_local.md5
diff file_local.md5 file_provided.md5
Now the pipeline runs md5_check to validation the results and will quit if errors are found. 

If the md5sum signature differs from that provided for the file:

  • Check to be sure you have the correct file.
  • Check if the md5sum was calculated on that compressed or uncompressed file by the provider and be sure to do the same with the local copy.
  • Try the download again.
  • Contact the sequence provider.

FastQ File Analyses

Run separately on each paired FASTQ file.

fastqc pair_1.fq


From the summary.txt report we check

  • FAIL sections

From the fastqc_data.txt file we check the following values:

  • Encoding (must be Sanger / Illumina 1.9)
  • Total Sequences (Currently set to 30000000)
  • Filtered Sequences (Currently set to less then 5)
  • Sequence length (must be >= 100 bp)
  • %GC (45 < x < 55)
  • Total Duplicate Percentage (Currently set to 60.0)

The pipeline now runs fastqc_check and output these result into QC-report.txt.

Indexing

The following indexing is required using BWA. This step only needs to be done once per reference fasta.

  • BWA
bwa index human_g1k_v37_decoy.fasta

Alignment

Align reads to the genome with bwa.

The 'BWA-mem' program will find the reference coordinates of the input reads (independent of their mate-pair). The following parameters are those used by the 1KG project and GATK for aligning Illumina data. The '-M' option for BWA mem is not required for FreeBayes, but is performed to allow cross compatibility with the UGP GATK based pipeline. Also SamBlaster is used for deduplication of reads rather than Picard Tools.

bwa mem -M -R 'read_group_line' reference.fasta pair_1.fq pair_2.fq | samblaster | sambamba view -f bam -l 0 -S /dev/stdin | sambamba sort -o outfile.bam /dev/stdin

For the BWA read group option (-R flag), the following values must be specified: (see SAM format specification).

  • ID
  • SM
  • LB
  • PL
  • PU

Example:

bwa mem -R '@RG\tID:ERR194147\tSM:NA12878\tLB:NA12878_1\tPL:ILLUMINA\tPU:ILLUMINA-1'

BAM File Analyses and Processing

The version of the UGP pipeline that uses GATK requires further processing of the alignment BAM files to improve GATK performance for variant calling steps. FreeBayes does not require these steps, but we perform them anyways in order to allow downstream compatibility of files with GATK.

Merge lane level BAMs to individual

  • This step only needs to run if you have multiple lanes per sample.
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp /usr/local/picard/dist/picard.jar MergeSamFiles.jar
    INPUT=#*_sorted.bam
    INPUT=#*_sorted.bam
    INPUT=[ ... ]
    OUTPUT=*.bam                          
    VALIDATION_STRINGENCY: LENIENT
    MAX_RECORDS_IN_RAM: 5000000
    CREATE_INDEX: True
    SORT_ORDER: coordinate
    ASSUME_SORTED: True 
    USE_THREADING: True
    &> MergeSamFiles.log-#

BAM Quality Control

At this point the pipeline has broken the tasks into chromosomal regions.

  • CollectMultipleMetrics
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp /usr/local/picard/dist/picard.jar CollectMultipleMetrics
   INPUT=*_sorted_Dedup_realign_chr#_recal.bam
   OUTPUT=*_sorted_Dedup_realign_chr#_recal.metrics 
   VALIDATION_STRINGENCY=LENIENT 
   PROGRAM=QualityScoreDistribution  
   REFERENCE_SEQUENCE=human_g1k_v37_decoy.fasta 
   &> CollectMultipleMetrics.log-#
  • idxstats
samtools idxstats [dedup bam files] > dedup-bamfile.stats

Now the pipeline will take idxstats ouput and check for unmapped reads.

Local Realignment of Indels

  • RealignerTargetCreator

This is where the tasks are broken into chromosomal regions.

java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T RealignerTargetCreator 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup.bam 
   --num_threads #  
   --known Mills_and_1000G_gold_standard.indels.b37.vcf
   --known 1000G_phase1.indels.b37.vcf  
   -L chr#_region_file.list 
   -o *_chr#_realign.intervals 
   &> RealignerTargetCreator.log-#
  • IndelRealigner
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T IndelRealigner 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup.bam 
   -L chr#_region_file.list
   -targetIntervals *_chr#_realign.intervals 
   -known Mills_and_1000G_gold_standard.indels.b37.vcf 
   -known 1000G_phase1.indels.b37.vcf 
   -o *_sorted_Dedup_realign_chr#.bam 
   &> IndelRealigner.log-1

BaseRecalibration & PrintReads

  • BaseRecalibrator
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T BaseRecalibrator 
   -R /human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_realign_chr#.bam 
   --num_cpu_threads_per_data_thread #  
   --knownSites dbsnp_137.b37.vcf
   --knownSites Mills_and_1000G_gold_standard.indels.b37.vcf 
   --knownSites 1000G_phase1.indels.b37.vcf  
   -o *_sorted_Dedup_realign_chr#_recal_data.table 
   &> BaseRecalibrator.log-#
  • PrintReads
java -jar -Xmx#g -XX:ParallelGCThreads=# -Djava.io.tmpdir=/tmp GenomeAnalysisTK.jar 
   -T PrintReads 
   -R human_g1k_v37_decoy.fasta 
   -I *_sorted_Dedup_realign_chr#.bam
   --num_cpu_threads_per_data_thread #  
   -BQSR *_sorted_Dedup_realign_chr#_recal_data.table 
   -o *_sorted_Dedup_realign_chr#_recal.bam 
   &> PrintReads.log-#

Variant Calling

FreeBayes

All sample BAM files as well as background BAM files must be listed (one per line) in a simple text file for FreeBayes to locate them.

freebayes -L bam_file_list.txt -v outfile.vcf -f reference.fasta

VCFfilter

Freebayes outputs most variants for reference purposes even if they are low quality. The results must therefore be filtered before using them in a downstream analysis.

vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" input.vcf > output.vcf

Variant File QC

Quality Metrics on variants

  • Ti/Tv Ratio (2.1 for WGS ~2.8 for exome)
  • HapMap concordance
  • SNV/Indel Counts
  • Rare variant enrichment
  • DP
  • Q
  • GQ