Difference between revisions of "The MAKER control files explained"

From MAKER Wiki
Jump to navigation Jump to search
Line 1: Line 1:
MAKER is given all of the information it needs to run through three control files, <tt>maker_opts.ctl</tt>, <tt>maker_bopts.ctl</tt>, and <tt>maker_exe.ctl</tt>. Each file contains many options, some are essential for MAKER to run and others are there to help MAKER run better on your species. Making informed thoughtful decisions when setting control file options will result in a more accurate  annoatation for most species than simply accepting the defaults. Now lets go through the files.
+
MAKER is given all of the information it needs to run through three control files, <tt>maker_opts.ctl</tt>, <tt>maker_bopts.ctl</tt>, and <tt>maker_exe.ctl</tt>. Each file contains many options, some are essential for MAKER to run and others are there to help MAKER run better on your species. Making informed thoughtful decisions when setting control file options will result in a more accurate  annotation for most species than simply accepting the defaults. Now lets go through the files.
  
  
Line 12: Line 12:
  
 
<pre class="enter">
 
<pre class="enter">
genome=/home/user/projects/thomas_the_train/assembly/scaffolds.fasta #genome sequence (fasta format or fasta embeded in GFF3)
+
genome=/home/user/projects/thomas_the_train/assembly/scaffolds.fasta #genome sequence (fasta format or fasta embedded in GFF3)
 
</pre>
 
</pre>
This is a single multifasta file that contains the assembled genome. In the example above I have given an absolute path, but a reletive path would have worked as well. As a rule fo thumb, to get a good annotation the scaffold N50 should be larger than the expected median gene length. You can get an estimate of your average gene length by looking at previously annotated closely related species. It is also important to note that though there are a large number of characters accepted by fasta format to represent nucleotides many of them are not supported by some of the tools maker calls such as exonerate, so it is a good idea to make sure that your fasta sequence contains only A, T, C, G, and N
+
This is a single multifasta file that contains the assembled genome. In the example above I have given an absolute path, but a relative path would have worked as well. As a rule of thumb, to get a good annotation the scaffold N50 should be larger than the expected median gene length. You can get an estimate of your average gene length by looking at previously annotated closely related species. It is also important to note that though there are a large number of characters accepted by fasta format to represent nucleotides many of them are not supported by some of the tools maker calls such as exonerate, so it is a good idea to make sure that your fasta sequence contains only A, T, C, G, and N
  
  
Line 87: Line 87:
 
est=/home/user/projects/thomas_the_train/trinity/funnel.fasta,../trinity/coaltender.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis)
 
est=/home/user/projects/thomas_the_train/trinity/funnel.fasta,../trinity/coaltender.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis)
 
</pre>
 
</pre>
This is the first option that we have come to that can accept multiple files. In the example above I have passed in two files. MAKER has several options in additoin to this one that can accept multiple files as a comma seperated list. These files contain assembled mRNA-Seq transcripts, ESTs, or full length cDNAs. In the example above I have given an absolute path to one tissue specific assembled mRNA-Seq file and a reletive path to another.
+
This is the first option that we have come to that can accept multiple files. In the example above I have passed in two files. MAKER has several options in addition to this one that can accept multiple files as a comma separated list. These files contain assembled mRNA-Seq transcripts, ESTs, or full length cDNAs. In the example above I have given an absolute path to one tissue specific assembled mRNA-Seq file and a relative path to another.
  
 
<pre class="enter">
 
<pre class="enter">
 
altest= #EST/cDNA sequence file in fasta format from an alternate organism
 
altest= #EST/cDNA sequence file in fasta format from an alternate organism
 
</pre>
 
</pre>
This is useful if you don’t have any RNA evidence from your model organism. However, these alignments are done with tblastx so the genome and the RNA evidence is translated in all three reading frames and all possible alignments are explored so it takes a really long time. If you have good mRNA-seq data or ests you wont gain much form using this kind of evidence. This option was used much more before mRNA-Seq became so cheap and routein for genome projects. Since these data are from another species and they are aligned in protein space they are not used to infer UTR.
+
This is useful if you don’t have any RNA evidence from your model organism. However, these alignments are done with tblastx so the genome and the RNA evidence is translated in all three reading frames and all possible alignments are explored so it takes a really long time. If you have good mRNA-seq data or ests you wont gain much form using this kind of evidence. This option was used much more before mRNA-Seq became so cheap and routine for genome projects. Since these data are from another species and they are aligned in protein space they are not used to infer UTR.
  
  
Line 98: Line 98:
 
est_gff=../cufflinks/boiler.gff,../cufflinks/brake.gff #EST evidence from an external gff3 file
 
est_gff=../cufflinks/boiler.gff,../cufflinks/brake.gff #EST evidence from an external gff3 file
 
</pre>
 
</pre>
These are prealigned transcripts from the organism being annotated in gff3 format. The most common sources of this kind of data are algingment based trnascript assemblers such as cufflinks, or output from a previous MAKER annotation.  
+
These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common sources of this kind of data are algingment based trnascript assemblers such as cufflinks, or output from a previous MAKER annotation.  
  
  
Line 104: Line 104:
 
altest_gff= #Alternate organism EST evidence from a separate gff3 file
 
altest_gff= #Alternate organism EST evidence from a separate gff3 file
 
</pre>
 
</pre>
These are prealigned transcripts from the organism being annotated in gff3 format. The most common source of this kind of data is output from a previous MAKER annotation. After altests have been aligned once passing them back to MAKER in gff3 format in subsequent MAKER runs can decreas runtime substantialy.  
+
These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common source of this kind of data is output from a previous MAKER annotation. After altests have been aligned once passing them back to MAKER in gff3 format in subsequent MAKER runs can decrease runtime substantially.  
  
  
Line 112: Line 112:
 
MAKER uses exonerate to align around splice sites for proteins in FASTA files.  MAKER can use them to infer gene models directly (protein2genome option), but only if they align correctly around splice sites.  MAKER can use them as support for maintaining predictions (the CDS will be checked where possible to ensure the gene prediction and protein alignment are in the same reading frame).  How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use ORFs inferred from the proteins to inform gene predictors during the prediction step.  
 
MAKER uses exonerate to align around splice sites for proteins in FASTA files.  MAKER can use them to infer gene models directly (protein2genome option), but only if they align correctly around splice sites.  MAKER can use them as support for maintaining predictions (the CDS will be checked where possible to ensure the gene prediction and protein alignment are in the same reading frame).  How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use ORFs inferred from the proteins to inform gene predictors during the prediction step.  
  
Depending on the organism I will use a recent download of uniprot/swiss-prot or "NP" sequences from ref-seq. I avoid uniprot/tremble and genbank since they contain unreviewed sequences that may not exist or are psudogenes or repetitive elements. You can also use a selection of high confidence proteins from a closely related species. These could be the products of transcripts with an AED less than 0.5 in the other species if annotated by MAKER. In most annotations there are a number of weakly supported genes that are often dead transposons or psudogenes, so I do not like to take all the annotated proteins from a related species. In the worst cases scenario imagine you have a dead transposon from a closely related species, when you made your repeat masking library you found that one the entries matched this sequence. Assuming that it is a real gene you delete it from your masking library. Now when you annotate the genome this one gene becomes a whole gene family in the annotation set but is really a combination of bad evidence and bad masking.
+
Depending on the organism I will use a recent download of uniprot/swiss-prot or "NP" sequences from ref-seq. I avoid uniprot/tremble and genbank since they contain unreviewed sequences that may not exist or are pseudogenes or repetitive elements. You can also use a selection of high confidence proteins from a closely related species. These could be the products of transcripts with an AED less than 0.5 in the other species if annotated by MAKER. In most annotations there are a number of weakly supported genes that are often dead transposons or pseudogenes, so I do not like to take all the annotated proteins from a related species. In the worst cases scenario imagine you have a dead transposon from a closely related species, when you made your repeat masking library you found that one the entries matched this sequence. Assuming that it is a real gene you delete it from your masking library. Now when you annotate the genome this one gene becomes a whole gene family in the annotation set but is really a combination of bad evidence and bad masking.
  
  
Line 118: Line 118:
 
protein=../protein/swiss_prot.fasta  #protein sequence file in fasta format
 
protein=../protein/swiss_prot.fasta  #protein sequence file in fasta format
 
</pre>
 
</pre>
This is a colection of protein sequences in fasta format.  
+
This is a collection of protein sequences in fasta format.  
  
  
Line 124: Line 124:
 
protein_gff=  #protein homology evidence from an external gff3 file
 
protein_gff=  #protein homology evidence from an external gff3 file
 
</pre>
 
</pre>
These are prealigned proteins in gff3 format. The most common source of these data is a previous MAKER run.
+
These are pre-aligned proteins in gff3 format. The most common source of these data is a previous MAKER run.
  
  
Line 142: Line 142:
 
rmlib=../repeat_lib/thomas_TEs.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
 
rmlib=../repeat_lib/thomas_TEs.fasta #provide an organism specific repeat library in fasta format for RepeatMasker
 
</pre>
 
</pre>
This is where you put in your custom repeat library there are severl ways to generate this file, example of two of them are found here [[Repeat Library Construction--Basic]] and here [[Repeat Library Construction--Advanced]]. Several of the weird and wonderful organisms that have been annotated by MAKER were in essence unanotatable without a species specific repeat library.  
+
This is where you put in your custom repeat library there are several ways to generate this file, example of two of them are found here [[Repeat Library Construction--Basic]] and here [[Repeat Library Construction--Advanced]]. Several of the weird and wonderful organisms that have been annotated by MAKER were in essence un-annotatable without a species specific repeat library.  
  
 
<pre class="enter">
 
<pre class="enter">
Line 153: Line 153:
 
rm_gff= #repeat elements from an external GFF3 file
 
rm_gff= #repeat elements from an external GFF3 file
 
</pre>
 
</pre>
These are prealigned repeats in gff3 format. The most common source of these data is a previous MAKER run.  
+
These are pre-aligned repeats in gff3 format. The most common source of these data is a previous MAKER run.  
  
  
Line 159: Line 159:
 
prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no
 
prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no
 
</pre>
 
</pre>
This was added in responce to a collaborators request. As a general rule masking a prokaryotic genome is unnessisary and can lead to trunkated gene models. That being said, there are some strange critters out there and there are situation where this may be usefull.
+
This was added in response to a collaborators request. As a general rule masking a prokaryotic genome is unnecessary and can lead to truncated gene models. That being said, there are some strange critters out there and there are situation where this may be useful.
  
  
Line 165: Line 165:
 
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
 
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
 
</pre>
 
</pre>
Softmasking in blast prevents aligments from seeding in regions of low comlexity but allows alignments to extend through these regions.
+
Soft-masking in blast prevents alignments from seeding in regions of low complexity but allows alignments to extend through these regions.
  
  
Line 171: Line 171:
 
#-----Gene Prediction
 
#-----Gene Prediction
 
</pre>
 
</pre>
If you want to get gene models out of MAKER you have to put something in this section. MAKER needs some kind of gene prediction to work off of. Not everything in this section is used in the same way by MAKER. Gene predictions from tools such as SNAP, agustus, and genemark are lower confidence than gene models.  They do not affect evidence clustering.  MAKER can keep them as they are or modify them by trimming and adding exons based on EST evidence.  These will only be maintained in the final annotation set if there is some form of evidence supporting their structure. When multiple gene finders are used MAKER will compete them against each other and will chose to work off of the one that matches the evidence the best.
+
If you want to get gene models out of MAKER you have to put something in this section. MAKER needs some kind of gene prediction to work off of. Not everything in this section is used in the same way by MAKER. Gene predictions from tools such as SNAP, augustus, and genemark are lower confidence than gene models.  They do not affect evidence clustering.  MAKER can keep them as they are or modify them by trimming and adding exons based on EST evidence.  These will only be maintained in the final annotation set if there is some form of evidence supporting their structure. When multiple gene finders are used MAKER will compete them against each other and will chose to work off of the one that matches the evidence the best.
  
 
Gene models on the other hand are assumed to be high confidence. These will affect clustering of evidence.  Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases.  They can also be used for maintaining names.  MAKER is only allowed to keep or replace models and cannot modify them.  If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).  
 
Gene models on the other hand are assumed to be high confidence. These will affect clustering of evidence.  Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases.  They can also be used for maintaining names.  MAKER is only allowed to keep or replace models and cannot modify them.  If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).  
Line 178: Line 178:
 
snaphmm=../trained_snap/thomas_1.hmm #SNAP HMM file
 
snaphmm=../trained_snap/thomas_1.hmm #SNAP HMM file
 
</pre>
 
</pre>
This is where you put the HMM file required to run SNAP. This file can be species specific or from a closely related species. This option can also accept multiple files and SNAP will be run seperatly using each HHM. In the example above I have passed MAKER a species specific HMM.  
+
This is where you put the HMM file required to run SNAP. This file can be species specific or from a closely related species. This option can also accept multiple files and SNAP will be run separately using each HHM. In the example above I have passed MAKER a species specific HMM.  
  
  
Line 184: Line 184:
 
gmhmm=../train_genemark/es.mod #GeneMark HMM file
 
gmhmm=../train_genemark/es.mod #GeneMark HMM file
 
</pre>
 
</pre>
This is where you out the HMM file required for genemark. This option can also accept multiple files and genemark will be run seperatly using each HHM.  
+
This is where you out the HMM file required for genemark. This option can also accept multiple files and genemark will be run separately using each HHM.  
  
  
Line 190: Line 190:
 
augustus_species=steam_tram #Augustus gene prediction species model
 
augustus_species=steam_tram #Augustus gene prediction species model
 
</pre>
 
</pre>
This one is a challenge to train but it works very well. You just put the species name here. To get a list of option look in the config/species/ directory in the augustus exicutable. If you have uses the autoaug.pl program to train augustus yourself you will see your species in the spieceis directory as well. In this example I am passing in a closely related species.
+
This one is a challenge to train but it works very well. You just put the species name here. To get a list of option look in the config/species/ directory in the augustus excitable. If you have uses the autoaug.pl program to train augustus yourself you will see your species in the spieceis directory as well. In this example I am passing in a closely related species.
  
  
Line 202: Line 202:
 
pred_gff= #ab-initio predictions from an external GFF3 file
 
pred_gff= #ab-initio predictions from an external GFF3 file
 
</pre>
 
</pre>
Eventhoug MAKER doesn't run a specific gene predictor internaly doesn't mean that it can use it. You can use predictions from any gene finder in MAKER as long as you convert the gene finder output to gff3 format.   
+
Even though MAKER doesn't run a specific gene predictor internally doesn't mean that it can use it. You can use predictions from any gene finder in MAKER as long as you convert the gene finder output to gff3 format.   
  
  
Line 214: Line 214:
 
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
 
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
 
</pre>
 
</pre>
This option allows you to make gene models directly from the trnascript evidence. This option is useful when you dont have a gene predictor trained on your organism and there is not a training file availabel for a closely related organims. The gene models from this option are going to be fragmented and incomplete because of the nature of trnascript data, especial mRNA-Seq. These gene models are most useful for first round training of gene finders. One you have a trained gene predictor turn this option off.
+
This option allows you to make gene models directly from the transcript evidence. This option is useful when you don't have a gene predictor trained on your organism and there is not a training file available for a closely related organism. The gene models from this option are going to be fragmented and incomplete because of the nature of transcript data, especial mRNA-Seq. These gene models are most useful for first round training of gene finders. One you have a trained gene predictor turn this option off.
  
  
Line 220: Line 220:
 
protein2genome=0 #gene prediction from protein homology, 1 = yes, 0 = no
 
protein2genome=0 #gene prediction from protein homology, 1 = yes, 0 = no
 
</pre>
 
</pre>
Similar to est2genome this option will make gene models out of protien data. Like est2genome this option is most usefull for training gene predictors.
+
Similar to est2genome this option will make gene models out of protein data. Like est2genome this option is most useful for training gene predictors.
  
  
Line 226: Line 226:
 
unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
 
unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no
 
</pre>
 
</pre>
This option lets the gene finders run on the unmasked sequence as well as the masked. All of the resulting predictions are competed againest each other and the one the matched the evidence best is choesen for MAKER to work off of.
+
This option lets the gene finders run on the unmasked sequence as well as the masked. All of the resulting predictions are competed against each other and the one the matched the evidence best is chosen for MAKER to work off of.
  
  
Line 233: Line 233:
 
</pre>
 
</pre>
 
This section lets you give MAKER data to add to the final annotation set that MAKER doesn't annotate itself.
 
This section lets you give MAKER data to add to the final annotation set that MAKER doesn't annotate itself.
 +
  
 
<pre class="enter">
 
<pre class="enter">
 
other_gff= #features to pass-through to final output from an extenal GFF3 file
 
other_gff= #features to pass-through to final output from an extenal GFF3 file
 
</pre>
 
</pre>
These are GFF3 lines you just want MAKER to add to your files.  Normally representing things MAKER doesn't predict (promotor/enhancer regions, CpG islands, restrictions sites, non-cdong RNAs, etc).  MAKER will not attempt to validate the features, but will just pass them through "as is" to the final GFF3 file.
+
These are GFF3 lines you just want MAKER to add to your files.  Normally representing things MAKER doesn't predict (promotor/enhancer regions, CpG islands, restrictions sites, non-coding RNAs, etc).  MAKER will not attempt to validate the features, but will just pass them through "as is" to the final GFF3 file.
 
 
  
  

Revision as of 17:48, 16 August 2013

MAKER is given all of the information it needs to run through three control files, maker_opts.ctl, maker_bopts.ctl, and maker_exe.ctl. Each file contains many options, some are essential for MAKER to run and others are there to help MAKER run better on your species. Making informed thoughtful decisions when setting control file options will result in a more accurate annotation for most species than simply accepting the defaults. Now lets go through the files.


maker_opts.ctl

The maker_opts.ctl file is the workhorse of the control files and where the vast majority of the parameters are set so here we go line by line.

#-----Genome (Required for De-Novo Annotation)

Headings for sections in the control files are marked by a pound sign and five dashes. These headings are not actually used by MAKER but are helpful when you are trying to find a specific option or parameter. This heading points out that following options must be set for De novo annotation, which is true, but I can't think of any uses of MAKER that do not require a genome.


genome=/home/user/projects/thomas_the_train/assembly/scaffolds.fasta #genome sequence (fasta format or fasta embedded in GFF3)

This is a single multifasta file that contains the assembled genome. In the example above I have given an absolute path, but a relative path would have worked as well. As a rule of thumb, to get a good annotation the scaffold N50 should be larger than the expected median gene length. You can get an estimate of your average gene length by looking at previously annotated closely related species. It is also important to note that though there are a large number of characters accepted by fasta format to represent nucleotides many of them are not supported by some of the tools maker calls such as exonerate, so it is a good idea to make sure that your fasta sequence contains only A, T, C, G, and N


organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

MAKER's default is eukaryotic. Setting this to prokaryotic changes some of the MAKER behavior options automatically, such as turning off repeat masking.


#-----Re-annotation Using MAKER Derived GFF3

This section was originally developed for rerunning MAKER with the same evidence from a previous MAKER after the original data store had gone the way of all the earth (been deleted). This uses an SQLight database to keep track of the information in the gff3 file. The default configuration for SQLight doesn't handle the really big MAKER produced gff3 files (large genome with lots of evidence) very well. If you have a big file and you want to reuse the data there are places further down in this file that allow you to give evidence, gene predictions, and repeat data in gff3 format. You could also recompile SQLight to set DSQLITE_MAX_EXPR_DEPTH=0. There are instructions for doing this on the MAKER dev list.


maker_gff= #re-annotate genome based on this gff3 file

Path to to the MAKER generated gff3 file


est_pass=0 #use ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the EST/mRNA-Seq alignments from the MAKER file. See est= below for details.


altest_pass=0 #use alternate organism ests in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the alternative EST/mRNA-seq alignments from the MAKER file. See altest= below for details.


protein_pass=0 #use proteins in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the protein alignments from the MAKER file. See protein= below for details.


rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the repeat masking data from the MAKER file. See the #-----Repeat Masking section below for details.


model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene models from the MAKER file. See model_gff= below for details.


pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no

Set to 1 if you want to use the gene predictions from the MAKER file. See pred_gff= below for details. See pred_gff= below for details.


other_pass=0 #passthrough everything else in maker_gff: 1 = yes, 0 = no

See other_gff= below for details.


#-----EST Evidence (for best results provide a file for at least one)

This might be more appropriately titled "Transcript Evidence" because this is where you pass in not only ESTs but also assembled mRNA-seq and assembled full length cDNAs. These are assumed to be correctly assembled and aligned around splice sites (MAKER uses exonerate to align around splice sites for ESTs in FASTA files). MAKER can use them to infer gene models directly (est2genome option), can use them as support for maintaining predictions, and can use them to modify structure and add UTR to predictions. If you let MAKER try and find alternative splice forms, they will be used to identify support for splice variants. How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use splice sites inferred from the ESTs to inform gene predictors during the prediction step.


est=/home/user/projects/thomas_the_train/trinity/funnel.fasta,../trinity/coaltender.fasta #non-redundant set of assembled ESTs in fasta format (classic EST analysis)

This is the first option that we have come to that can accept multiple files. In the example above I have passed in two files. MAKER has several options in addition to this one that can accept multiple files as a comma separated list. These files contain assembled mRNA-Seq transcripts, ESTs, or full length cDNAs. In the example above I have given an absolute path to one tissue specific assembled mRNA-Seq file and a relative path to another.

altest= #EST/cDNA sequence file in fasta format from an alternate organism

This is useful if you don’t have any RNA evidence from your model organism. However, these alignments are done with tblastx so the genome and the RNA evidence is translated in all three reading frames and all possible alignments are explored so it takes a really long time. If you have good mRNA-seq data or ests you wont gain much form using this kind of evidence. This option was used much more before mRNA-Seq became so cheap and routine for genome projects. Since these data are from another species and they are aligned in protein space they are not used to infer UTR.


est_gff=../cufflinks/boiler.gff,../cufflinks/brake.gff #EST evidence from an external gff3 file

These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common sources of this kind of data are algingment based trnascript assemblers such as cufflinks, or output from a previous MAKER annotation.


altest_gff= #Alternate organism EST evidence from a separate gff3 file

These are pre-aligned transcripts from the organism being annotated in gff3 format. The most common source of this kind of data is output from a previous MAKER annotation. After altests have been aligned once passing them back to MAKER in gff3 format in subsequent MAKER runs can decrease runtime substantially.


#-----Protein Homology Evidence (for best results provide a file for at least one)

MAKER uses exonerate to align around splice sites for proteins in FASTA files. MAKER can use them to infer gene models directly (protein2genome option), but only if they align correctly around splice sites. MAKER can use them as support for maintaining predictions (the CDS will be checked where possible to ensure the gene prediction and protein alignment are in the same reading frame). How these cluster with other evidence will help MAKER infer gene boundaries in some cases. MAKER will also use ORFs inferred from the proteins to inform gene predictors during the prediction step.

Depending on the organism I will use a recent download of uniprot/swiss-prot or "NP" sequences from ref-seq. I avoid uniprot/tremble and genbank since they contain unreviewed sequences that may not exist or are pseudogenes or repetitive elements. You can also use a selection of high confidence proteins from a closely related species. These could be the products of transcripts with an AED less than 0.5 in the other species if annotated by MAKER. In most annotations there are a number of weakly supported genes that are often dead transposons or pseudogenes, so I do not like to take all the annotated proteins from a related species. In the worst cases scenario imagine you have a dead transposon from a closely related species, when you made your repeat masking library you found that one the entries matched this sequence. Assuming that it is a real gene you delete it from your masking library. Now when you annotate the genome this one gene becomes a whole gene family in the annotation set but is really a combination of bad evidence and bad masking.


protein=../protein/swiss_prot.fasta  #protein sequence file in fasta format

This is a collection of protein sequences in fasta format.


protein_gff=  #protein homology evidence from an external gff3 file

These are pre-aligned proteins in gff3 format. The most common source of these data is a previous MAKER run.


#-----Repeat Masking (leave values blank to skip repeat masking)

Repeats will be masked to stop EST and proteins from aligning to repetitive regions and to keep gene prediction algorithms from being allowed to call exons in those regions. Many repeats encode real proteins (i.e. retro-transposase and others). Because of this gene predictors and aligners are often confused by them (they can falsely be added as exons onto gene calls for example).


model_org=all #select a model organism for RepBase masking in RepeatMasker

Common things to put in here are all, mammals, grasses, primates, monocotyledons. you can also put in genus and species as long as they re bound by double quotes e.g., "drosophila melanogaster". You can also put simple here to use repeat masker to make only low complexity repeats (simple is MAKER specific). The repeat masker documentation has more detailed information. If you have gone to the trouble of making a custom repeat library you may not need to use RepBase.


rmlib=../repeat_lib/thomas_TEs.fasta #provide an organism specific repeat library in fasta format for RepeatMasker

This is where you put in your custom repeat library there are several ways to generate this file, example of two of them are found here Repeat Library Construction--Basic and here Repeat Library Construction--Advanced. Several of the weird and wonderful organisms that have been annotated by MAKER were in essence un-annotatable without a species specific repeat library.

repeat_protein=/Users/mcampbell/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner

This is a file of known transposable elements that comes packaged with MAKER. They are aligned in protein space to help mask known transposable elements that have diverged over time.


rm_gff= #repeat elements from an external GFF3 file

These are pre-aligned repeats in gff3 format. The most common source of these data is a previous MAKER run.


prok_rm=0 #forces MAKER to run repeat masking on prokaryotes (don't change this), 1 = yes, 0 = no

This was added in response to a collaborators request. As a general rule masking a prokaryotic genome is unnecessary and can lead to truncated gene models. That being said, there are some strange critters out there and there are situation where this may be useful.


softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

Soft-masking in blast prevents alignments from seeding in regions of low complexity but allows alignments to extend through these regions.


#-----Gene Prediction

If you want to get gene models out of MAKER you have to put something in this section. MAKER needs some kind of gene prediction to work off of. Not everything in this section is used in the same way by MAKER. Gene predictions from tools such as SNAP, augustus, and genemark are lower confidence than gene models. They do not affect evidence clustering. MAKER can keep them as they are or modify them by trimming and adding exons based on EST evidence. These will only be maintained in the final annotation set if there is some form of evidence supporting their structure. When multiple gene finders are used MAKER will compete them against each other and will chose to work off of the one that matches the evidence the best.

Gene models on the other hand are assumed to be high confidence. These will affect clustering of evidence. Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases. They can also be used for maintaining names. MAKER is only allowed to keep or replace models and cannot modify them. If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).

snaphmm=../trained_snap/thomas_1.hmm #SNAP HMM file

This is where you put the HMM file required to run SNAP. This file can be species specific or from a closely related species. This option can also accept multiple files and SNAP will be run separately using each HHM. In the example above I have passed MAKER a species specific HMM.


gmhmm=../train_genemark/es.mod #GeneMark HMM file

This is where you out the HMM file required for genemark. This option can also accept multiple files and genemark will be run separately using each HHM.


augustus_species=steam_tram #Augustus gene prediction species model

This one is a challenge to train but it works very well. You just put the species name here. To get a list of option look in the config/species/ directory in the augustus excitable. If you have uses the autoaug.pl program to train augustus yourself you will see your species in the spieceis directory as well. In this example I am passing in a closely related species.


fgenesh_par_file= #Fgenesh parameter file

If you have an fgenesh par file this is where you put it.


pred_gff= #ab-initio predictions from an external GFF3 file

Even though MAKER doesn't run a specific gene predictor internally doesn't mean that it can use it. You can use predictions from any gene finder in MAKER as long as you convert the gene finder output to gff3 format.


model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)

These are assumed to be high confidence gene models. These will affect clustering of evidence. Because they are high confidence, the clustering will slightly bias MAKER towards keeping rather than replacing previous models for borderline cases. They can also be used for maintaining names. MAKER is only allowed to keep or replace models and cannot modify them. If no evidence supports them, MAKER can still keep them because they are assumed to be high confidence (but MAKER will still tag them with an AED score of 1 in those cases).


est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no

This option allows you to make gene models directly from the transcript evidence. This option is useful when you don't have a gene predictor trained on your organism and there is not a training file available for a closely related organism. The gene models from this option are going to be fragmented and incomplete because of the nature of transcript data, especial mRNA-Seq. These gene models are most useful for first round training of gene finders. One you have a trained gene predictor turn this option off.


protein2genome=0 #gene prediction from protein homology, 1 = yes, 0 = no

Similar to est2genome this option will make gene models out of protein data. Like est2genome this option is most useful for training gene predictors.


unmask=0 #Also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

This option lets the gene finders run on the unmasked sequence as well as the masked. All of the resulting predictions are competed against each other and the one the matched the evidence best is chosen for MAKER to work off of.


#-----Other Annotation Feature Types (features MAKER doesn't recognize)

This section lets you give MAKER data to add to the final annotation set that MAKER doesn't annotate itself.


other_gff= #features to pass-through to final output from an extenal GFF3 file

These are GFF3 lines you just want MAKER to add to your files. Normally representing things MAKER doesn't predict (promotor/enhancer regions, CpG islands, restrictions sites, non-coding RNAs, etc). MAKER will not attempt to validate the features, but will just pass them through "as is" to the final GFF3 file.


#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)
#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases  memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)
pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #force start and stop codon into every gene, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Add unsupported gene prediction to final annotation set, 1 = yes, 0 = no
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes
tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files
#-----EVALUATOR Control Options
evaluate=0 #run EVALUATOR on all annotations (very experimental), 1 = yes, 0 = no
side_thre=5
eva_window_size=70
eva_split_hit=1
eva_hspmax=100
eva_gspmax=100
enable_fathom=0

maker_exe.ctl

#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/Users/mcampbell/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/Users/mcampbell/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executable
blastx=/Users/mcampbell/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executable
tblastx=/Users/mcampbell/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
RepeatMasker=/Users/mcampbell/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/Users/mcampbell/maker/bin/../exe/exonerate/bin/exonerate
#-----Ab-initio Gene Prediction Algorithms
snap=/Users/mcampbell/maker/bin/../exe/snap/snap #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/Users/mcampbell/maker/bin/../exe/augustus/bin/augustus #location of augustus executable
fgenesh= #location of fgenesh executable
#-----Other Algorithms
fathom=/Users/mcampbell/maker/bin/../exe/snap/fathom #location of fathom executable (experimental)
probuild= #location of probuild executable (required for genemark)

maker_bopts.ctl

#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'
pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)
pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)
pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)
pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking
eva_pcov_blastn=0.8 #EVALUATOR Blastn Percent Coverage Threshold EST-Genome Alignments
eva_pid_blastn=0.85 #EVALUATOR Blastn Percent Identity Threshold EST-Genome Alignments
eva_eval_blastn=1e-10 #EVALUATOR Blastn eval cutoff
eva_bit_blastn=40 #EVALUATOR Blastn bit cutoff
ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold