The MAKER control files explained

From MAKER Wiki
Revision as of 19:09, 15 August 2013 by Mcampbell (talk | contribs)
Jump to navigation Jump to search

MAKER is given all of the information it needs to run through three control files, maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl. Each file contains many options, some are essential for MAKER to run and others are there to help MAKER run better on your species. Making informed thoughtful decisions when setting control file options will result in a more accurate annoatation for most species than simply accepting the defaults. Now lets go through the files.


maker_opts.ctl

The maker_opts.ctl file is the workhorse of the control files and where the vast majority of the parameters are set so here we go line by line.

#-----Genome (Required for De-Novo Annotation)

Headings for sections in the control files are marked by a pound sign and five dashes. These headings are not actually used by MAKER but are helpful when you are trying to find a specific option or parameter. This heading points out that following options must be set for De novo annotation, which is true, but I can't think of any uses of MAKER that do not require a genome.


genome= #genome sequence (fasta format or fasta embeded in GFF3)

This is a multifasta file that contains the assembled genome. To get a good annotation the scaffold N50 should be larger than the expected median gene length. You can get an estimate of your average gene length by looking at previously annotated closely related species. It is also important to note that though there are a large number of characters accepted by fasta format to represent nucleotides many of them are not supported by some of the tools maker calls such as exonerate, so it is a good idea to make sure that your fasta sequence contains only A, T, C, G, and N


organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

MAKER's default is eukaryotic. Setting this to prokaryotic changes some of the MAKER behavior options automatically, such as turning off repeat masking.


#-----Re-annotation Using MAKER Derived GFF3

This section was originally developed for rerunning MAKER with the same evidence from a previous MAKER after the original data store had gone the way of all the earth (been deleted). This uses an SQLight database to keep track of the information in the gff3 file. The default configuration for SQLight doesn't handle the really big MAKER produced gff3 files (large genome with lots of evidence) very well. If you have a big file and you want to reuse the data there are places further down in this file that allow you to give evidence, gene predictions, and repeat data in gff3 format. You could also recompile SQLight to set DSQLITE_MAX_EXPR_DEPTH=0. There are instructions for doing this on the MAKER dev list.


maker_gff= #re-annotate genome based on this gff3 file

Path to to the MAKER generated gff3 file


est_pass=0 #use ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the EST/mRNA-Seq alignments from the MAKER file. See est= below for details.


altest_pass=0 #use alternate organism ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the alternative EST/mRNA-seq alignments from the MAKER file. See altest= below for details.


protein_pass=0 #use proteins in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the protein alignments from the MAKER file. See protein= below for details.


rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the repeat masking data from the MAKER file. See the #-----Repeat Masking section below for details.


model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene models from the MAKER file. See model_gff= below for details.


pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene predictions from the MAKER file. See pred_gff= below for details. See pred_gff= below for details.


other_pass=0 #passthrough everything else in maker_gff: 1 = yes, 0 = no

See other_gff= below for details.


#-----EST Evidence (for best results provide a file for at least one)
est= #non-redundant set of assembled ESTs in fasta format (classic EST analysis)
altest= #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #EST evidence from an external gff3 file
altest_gff= #Alternate organism EST evidence from a separate gff3 file
#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=  #protein sequence file in fasta format
protein_gff=  #protein homology evidence from an external gff3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=all #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein=/Users/mcampbell/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= #Augustus gene prediction species model
fgenesh_par_file= #Fgenesh parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #gene prediction from protein homology, 1 = yes, 0 = no
unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #features to pass-through to final output from an extenal GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases  memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #force start and stop codon into every gene, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Add unsupported gene prediction to final annotation set, 1 = yes, 0 = no
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
#-----EVALUATOR Control Options
evaluate=0 #run EVALUATOR on all annotations (very experimental), 1 = yes, 0 = no
side_thre=5
eva_window_size=70
eva_split_hit=1
eva_hspmax=100
eva_gspmax=100
enable_fathom=0

maker_exe.ctl

#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/Users/mcampbell/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/Users/mcampbell/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executable
blastx=/Users/mcampbell/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executable
tblastx=/Users/mcampbell/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
RepeatMasker=/Users/mcampbell/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/Users/mcampbell/maker/bin/../exe/exonerate/bin/exonerate
#-----Ab-initio Gene Prediction Algorithms
snap=/Users/mcampbell/maker/bin/../exe/snap/snap #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/Users/mcampbell/maker/bin/../exe/augustus/bin/augustus #location of augustus executable
fgenesh= #location of fgenesh executable
#-----Other Algorithms
fathom=/Users/mcampbell/maker/bin/../exe/snap/fathom #location of fathom executable (experimental)
probuild= #location of probuild executable (required for genemark)

maker_bopts.ctl

#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'
pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)
pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)
pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)
pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignments
eva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignments
eva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoff
eva_bit_blastn=40 #EVALUATOR Blastn bit cutoff
ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold