The MAKER control files explained

From MAKER Wiki
Jump to navigation Jump to search

MAKER is given all of the information it needs to run through three control files, maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl. Each file contains many options, some are essential for MAKER to run and others are there to help MAKER run better on your species. Making informed thoughtful decisions when setting control file options will result in a more accurate annotation for most species than simply accepting the defaults. Now lets go through the files.


maker_opts.ctl

The maker_opts.ctl file is the workhorse of the control files and where the vast majority of the parameters are set so here we go line by line.

#-----Genome (Required for De-Novo Annotation)

Headings for sections in the control files are marked by a pound sign and five dashes. These headings are not actually used by MAKER but are helpful when you are trying to find a specific option or parameter. This heading points out that following options must be set for De novo annotation, which is true, but I can't think of any uses of MAKER that do not require a genome.


genome=/home/user/projects/thomas_the_train/assembly/scaffolds.fasta #genome sequence (fasta format or fasta embedded in GFF3)

This is a single multifasta file that contains the assembled genome. In the example above I have given an absolute path, but a relative path would have worked as well. As a rule of thumb, to get a good annotation the scaffold N50 should be larger than the expected median gene length. You can get an estimate of your average gene length by looking at previously annotated closely related species. It is also important to note that though there are a large number of characters accepted by fasta format to represent nucleotides many of them are not supported by some of the tools maker calls such as exonerate, so it is a good idea to make sure that your fasta sequence contains only A, T, C, G, and N


organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

MAKER's default is eukaryotic. Setting this to prokaryotic changes some of the MAKER behavior options automatically, such as turning off repeat masking.


#-----Re-annotation Using MAKER Derived GFF3

This section was originally developed for rerunning MAKER with the same evidence from a previous MAKER after the original data store had gone the way of all the earth (been deleted). This uses an SQLight database to keep track of the information in the gff3 file. The default configuration for SQLight doesn't handle the really big MAKER produced gff3 files (large genome with lots of evidence) very well. If you have a big file and you want to reuse the data there are places further down in this file that allow you to give evidence, gene predictions, and repeat data in gff3 format. You could also recompile SQLight to set DSQLITE_MAX_EXPR_DEPTH=0. There are instructions for doing this on the MAKER dev list.


maker_gff= #re-annotate genome based on this gff3 file

Path to to the MAKER generated gff3 file


est_pass=0 #use ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the EST/mRNA-Seq alignments from the MAKER file. See est= below for details.


altest_pass=0 #use alternate organism ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the alternative EST/mRNA-seq alignments from the MAKER file. See altest= below for details.


protein_pass=0 #use proteins in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the protein alignments from the MAKER file. See protein= below for details.


rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the repeat masking data from the MAKER file. See the #-----Repeat Masking section below for details.


model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene models from the MAKER file. See model_gff= below for details.


pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene predictions from the MAKER file. See pred_gff= below for details. See pred_gff= below for details.


other_pass=0 #passthrough everything else in maker_gff: 1 = yes, 0 = no

See other_gff= below for details.


#-----EST Evidence (for best results provide a file for at least one)

This might be more appropriately titled "Transcript Evidence" because this is where you pass in not only ESTs but also assembled mRNA-seq and assembled full length cDNAs. These are assumed to be correctly assembled and aligned around splice sites (MAKER uses exonerate to align around splice sites for ESTs in FASTA files). MAKER can use them to infer gene models directly (est2genome option), can use them as support for maintaining predictions, and can use them to modify structure and add UTR to predictions. If you let MAKER try and find alternative splice forms, they will be used to identify support for splice variants. How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use splice sites inferred from the ESTs to inform gene predictors during the prediction step.


est=/home/user/projects/thomas_the_train/trinity/funnel.fasta,../trinity/coaltender.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis)

This is the first option that we have come to that can accept multiple files. In the example above I have passed in two files. MAKER has several options in addition to this one that can accept multiple files as a comma separated list. These files contain assembled mRNA-Seq transcripts, ESTs, or full length cDNAs. In the example above I have given an absolute path to one tissue specific assembled mRNA-Seq file and a relative path to another.

altest= #EST/cDNA sequence file in fasta format from an alternate organism

This is useful if you don’t have any RNA evidence from your model organism. However, these alignments are done with tblastx so the genome and the RNA evidence is translated in all three reading frames and all possible alignments are explored so it takes a really long time. If you have good mRNA-seq data or ests you wont gain much form using this kind of evidence. This option was used much more before mRNA-Seq became so cheap and routine for genome projects. Since these data are from another species and they are aligned in protein space they are not used to infer UTR.


est_gff=../cufflinks/boiler.gff,../cufflinks/brake.gff #EST evidence from an external gff3 file

These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common sources of this kind of data are algingment based trnascript assemblers such as cufflinks, or output from a previous MAKER annotation.


altest_gff= #Alternate organism EST evidence from a separate gff3 file

These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common source of this kind of data is output from a previous MAKER annotation. After altests have been aligned once passing them back to MAKER in gff3 format in subsequent MAKER runs can decrease runtime substantially.


#-----Protein Homology Evidence (for best results provide a file for at least one)

MAKER uses exonerate to align around splice sites for proteins in FASTA files. MAKER can use them to infer gene models directly (protein2genome option), but only if they align correctly around splice sites. MAKER can use them as support for maintaining predictions (the CDS will be checked where possible to ensure the gene prediction and protein alignment are in the same reading frame). How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use ORFs inferred from the proteins to inform gene predictors during the prediction step.

Depending on the organism I will use a recent download of uniprot/swiss-prot or "NP" sequences from ref-seq. I avoid uniprot/tremble and genbank since they contain unreviewed sequences that may not exist or are pseudogenes or repetitive elements. You can also use a selection of high confidence proteins from a closely related species. These could be the products of transcripts with an AED less than 0.5 in the other species if annotated by MAKER. In most annotations there are a number of weakly supported genes that are often dead transposons or pseudogenes, so I do not like to take all the annotated proteins from a related species. In the worst cases scenario imagine you have a dead transposon from a closely related species, when you made your repeat masking library you found that one the entries matched this sequence. Assuming that it is a real gene you delete it from your masking library. Now when you annotate the genome this one gene becomes a whole gene family in the annotation set but is really a combination of bad evidence and bad masking.


protein=../protein/swiss_prot.fasta  #protein sequence file in fasta format

This is a collection of protein sequences in fasta format.


protein_gff=  #protein homology evidence from an external gff3 file

These are pre-aligned proteins in gff3 format. The most common source of these data is a previous MAKER run.


#-----Repeat Masking (leave values blank to skip repeat masking)

Repeats will be masked to stop EST and proteins from aligning to repetitive regions and to keep gene prediction algorithms from being allowed to call exons in those regions. Many repeats encode real proteins (i.e. retro-transposase and others). Because of this gene predictors and aligners are often confused by them (they can falsely be added as exons onto gene calls for example).


model_org=all #select a model organism for RepBase masking in RepeatMasker

Common things to put in here are all, mammals, grasses, primates, monocotyledons. you can also put in genus and species as long as they re bound by double quotes e.g., "drosophila melanogaster". You can also put simple here to use repeat masker to make only low complexity repeats (simple is MAKER specific). The repeat masker documentation has more detailed information. If you have gone to the trouble of making a custom repeat library you may not need to use RepBase.


rmlib=../repeat_lib/thomas_TEs.fasta #provide an organism specific repeat library in fasta format for RepeatMasker

This is where you put in your custom repeat library there are several ways to generate this file, example of two of them are found here Repeat Library Construction--Basic and here Repeat Library Construction--Advanced. Several of the weird and wonderful organisms that have been annotated by MAKER were in essence un-annotatable without a species specific repeat library.

repeat_protein=/Users/mcampbell/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner

This is a file of known transposable elements that comes packaged with MAKER. They are aligned in protein space to help mask known transposable elements that have diverged over time.


rm_gff= #repeat elements from an external GFF3 file

These are pre-aligned repeats in gff3 format. The most common source of these data is a previous MAKER run.


prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no

This was added in response to a collaborators request. As a general rule masking a prokaryotic genome is unnecessary and can lead to truncated gene models. That being said, there are some strange critters out there and there are situation where this may be useful.


softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

Soft-masking in blast prevents alignments from seeding in regions of low complexity but allows alignments to extend through these regions.


#-----Gene Prediction

If you want to get gene models out of MAKER you have to put something in this section. MAKER needs some kind of gene prediction to work off of. Not everything in this section is used in the same way by MAKER. Gene predictions from tools such as SNAP, augustus, and genemark are lower confidence than gene models. They do not affect evidence clustering. MAKER can keep them as they are or modify them by trimming and adding exons based on EST evidence. These will only be maintained in the final annotation set if there is some form of evidence supporting their structure. When multiple gene finders are used MAKER will compete them against each other and will chose to work off of the one that matches the evidence the best.

Gene models on the other hand are assumed to be high confidence. These will affect clustering of evidence. Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases. They can also be used for maintaining names. MAKER is only allowed to keep or replace models and cannot modify them. If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).

snaphmm=../trained_snap/thomas_1.hmm #SNAP HMM file

This is where you put the HMM file required to run SNAP. This file can be species specific or from a closely related species. This option can also accept multiple files and SNAP will be run separately using each HHM. In the example above I have passed MAKER a species specific HMM.


gmhmm=../train_genemark/es.mod #GeneMark HMM file

This is where you out the HMM file required for genemark. This option can also accept multiple files and genemark will be run separately using each HHM.


augustus_species=steam_tram #Augustus gene prediction species model

This one is a challenge to train but it works very well. You just put the species name here. To get a list of option look in the config/species/ directory in the augustus excitable. If you have uses the autoaug.pl program to train augustus yourself you will see your species in the spieceis directory as well. In this example I am passing in a closely related species.


fgenesh_par_file= #Fgenesh parameter file

If you have an fgenesh par file this is where you put it.


pred_gff= #ab-initio predictions from an external GFF3 file

Even though MAKER doesn't run a specific gene predictor internally doesn't mean that it can use it. You can use predictions from any gene finder in MAKER as long as you convert the gene finder output to gff3 format.


model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)

These are assumed to be high confidence gene models. These will affect clustering of evidence. Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases. They can also be used for maintaining names. MAKER is only allowed to keep or replace models and cannot modify them. If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).


est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no

This option allows you to make gene models directly from the transcript evidence. This option is useful when you don't have a gene predictor trained on your organism and there is not a training file available for a closely related organism. The gene models from this option are going to be fragmented and incomplete because of the nature of transcript data, especial mRNA-Seq. These gene models are most useful for first round training of gene finders. One you have a trained gene predictor turn this option off.


protein2genome=0 #gene prediction from protein homology, 1 = yes, 0 = no

Similar to est2genome this option will make gene models out of protein data. Like est2genome this option is most useful for training gene predictors.


unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

This option lets the gene finders run on the unmasked sequence as well as the masked. All of the resulting predictions are competed against each other and the one the matched the evidence best is chosen for MAKER to work off of.


#-----Other Annotation Feature Types (features MAKER doesn't recognize)

This section lets you give MAKER data to add to the final annotation set that MAKER doesn't annotate itself.


other_gff= #features to pass-through to final output from an extenal GFF3 file

These are GFF3 lines you just want MAKER to add to your files. Normally representing things MAKER doesn't predict (promotor/enhancer regions, CpG islands, restrictions sites, non-coding RNAs, etc). MAKER will not attempt to validate the features, but will just pass them through "as is" to the final GFF3 file.


#-----External Application Behavior Options

These options are passed to blast and can usually be left as default, especially if you are running MAKER with MPI.


alt_peptide=C #amino acid used to replace non standard amino acids in BLAST databases

This is a standard blast option that allows the user to specify the identity of non-standard amino acids in the blast database. C(cysteine) is here as the default because it has the the lowest mismatch penalty of all of the amino acids in the blosum matrix.


cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

If for some reason you can use MPI but have access to a machine with multiple cpus you can enter the number of cpus that you want blast to use. If you are using MPI MAKER handles this option infernally based on the number of cpus you specified with the mpiexec command.


#-----MAKER Behavior Options

These options are used to tune MAKER to run well on your genome.


max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases  memory usage)

It is important that this parameter is at least three times the expected maximum intron size. If memory is not limiting 300000 is a good max_dan_len, especially on large vertebrate genomes.


min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

Setting this option to 10000 can decrease run time without sacrificing annotation quality.


pred_flank=200 #flank for extending evidence clusters sent to gene predictors

If you are annotating a genome with a sparse/fragmented evidence set increasing this value can capture exons missing from your evidence. In very compact genomes decreasing this value can decrease gene merging.


pred_stats=0 #report AED and QI statistics for all predictions as well as models

Set this to 1 if you would like to more closely evaluate all of the gene predictions in the MAKER output gff3 file.


AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)

This option can be used to make super high confidence annotation sets. Setting this option to a value lower than 1 will result in a final annotation set with fewer gene models but they will be better supported by the evidence.


min_protein=0 #require at least this many amino acids in predicted proteins

Sometimes gene predictors will generate a large number of very short predictions. Because of the noisy nature of some evidence types, namely mRNA-Seq, may of these small predictions will look like they are supported by the evidence and make it into the final annotation set with and AED >1. Setting this option can prevent some of these spurious predictions with spurious evidence support from getting into the final annotation set.


alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no

When this parameter is set to 0 MAKER will generate a single annotation for each gene that best matches the evidence. When set to 1 MAKER will separate the evidence in each cluster into groups in which all of the evidence could have come from the same transcript. information form each group is then given to the gene finders as hints and the the gene finder is run again on that region of the genome. If the gene finder predicts an alternative transcript it is kept in the final GFF3 output.


always_complete=0 #force start and stop codon into every gene, 1 = yes, 0 = no

This was added as a request from a collaborator who wanted all gene models to have a start and a stop codon.


map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no

This is useful when you are updating a set of legacy annotations in light of new data and don't want to lose the information in column 9 of the legacy gff3 file.


keep_preds=0 #Add unsupported gene prediction to final annotation set, 1 = yes, 0 = no

This is used when you want an annotation set with maximum sensitivity. As a general rule gene finders trained on novel genomes have a tendency to over-predict sometimes quadrupling the number of annotated gene models.


split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)

This option is currently used to keep blast from aligning transcripts with exons unreasonable far apart. In versions 2.27 and below of MAKER this was also used to determine if a chunk might have split a gene. In versions 2.28 and above MAKER uses overlapping chunks mitigating the need for this value in finding genes split between chunks.


single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no

By default MAKER doesn't use single exon ests as supporting evidence for gene models unless there is protein support as well. Single exon ESTs and assembled mRNA-Seq transcripts are often the result of genomic contamination in the RNA prep. This decreases the sensitivity of MAKER some but the hit is specificity you take when you turn this on can be much worse for the overall accuracy.


single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'

if you do set single_exon to 1 this option helps keep the smallest sequences, those most likely to not make functional proteins, from being annotated.


correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

This was added to help prevent the merging of gene models because of overlapping UTRs, common in fungal genomes. This looks for gene models where the five prime UTR is greater than half the gene length. MAKER then breaks the gene at the start codon and runs the gene predictors over the region that was five prime UTR. This results in the loss of three prime UTR on the five prime gene and loss of five prime UTR on the three prime gene, but it is better than a merged gene.


tries=2 #number of times to try a contig if there is a failure for some reason

This can be a real time saver if you are running on a temperamental server or cluster. However, setting this to more than five is generally a wast of human and computational time. If a contig/scaffold keeps failing there is probably a reason and simply trying again won't fix it.


clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no

Some times it is best to start again from scratch. Setting this to one will remove all of the results files from the void as well as indexes that might be problematic.


clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no

This option will help save disk space by deleting individual results files (such as blast, exonerate, and gene predictor outputs) once they are no longer needed. If you have the disk space it is usually best to keep this set to 0. Having those files around will make rerunning MAKER much faster if necessary.

TMP= #specify a directory other than the system default temporary directory for temporary files

RepeatMasker writes a whole bunch of tmp files so you want to make sure you have adequate space on the drive this is pointed to. you also want to make sure it is pointed to a fast drive so the files are there when MAKER tries to read them. As a general rule NFS mounted drives are slower.


#-----EVALUATOR Control Options

This is the remains of an old project that hasn't quite been given up on. This produces additional quality information for each gene. It has not been extensively tested.

evaluate=0 #run EVALUATOR on all annotations (very experimental), 1 = yes, 0 = no
side_thre=5
eva_window_size=70
eva_split_hit=1
eva_hspmax=100
eva_gspmax=100
enable_fathom=0

maker_exe.ctl

#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/Users/mcampbell/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/Users/mcampbell/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executable
blastx=/Users/mcampbell/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executable
tblastx=/Users/mcampbell/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
RepeatMasker=/Users/mcampbell/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/Users/mcampbell/maker/bin/../exe/exonerate/bin/exonerate
#-----Ab-initio Gene Prediction Algorithms
snap=/Users/mcampbell/maker/bin/../exe/snap/snap #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/Users/mcampbell/maker/bin/../exe/augustus/bin/augustus #location of augustus executable
fgenesh= #location of fgenesh executable
#-----Other Algorithms
fathom=/Users/mcampbell/maker/bin/../exe/snap/fathom #location of fathom executable (experimental)
probuild= #location of probuild executable (required for genemark)

maker_bopts.ctl

#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'
pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)
pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)
pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)
pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignments
eva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignments
eva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoff
eva_bit_blastn=40 #EVALUATOR Blastn bit cutoff
ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold