Repeat Library Construction--Basic

From MAKER Wiki
Jump to navigation Jump to search

Content contributed by Dr. Ning Jiang

Building custom repeat library for plant genomes – Basic protocol

1. Collecting repetitive sequences by RepeatModeler

The genomic sequence (called seqfile,in fasta format) was processed by RepeatModeler

First command:

DIR/BuildDatabase -name seqfiledb -engine ncbi seqfile
  • DIR = path where RepeatModeler is.
  • “-engine ncbi” refers to the NCBI blast program that was used as the alignment tool.

Second command:

nohup DIR/RepeatModeler -database seqfiledb >& seqfile.out
  • After implementation of the commands, the RepeatModeler program generates a directory called “RM…”. Inside the directory there is a document called “consensi.fa.classified” that contains all the repetitive sequences. The definition line of each sequence contains the sequence name and the identity in RepeatMasker format. If the sequence is unidentified, it is marked as “Unknown”.
  • In our study, these with identities were put in ModelerID.lib and these with “Unkown” were in Modelerunknown.lib.
  • Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker package and Kennedy et al (2011)) and sequences matching transposase were considered as transposons belonging to the relevant superfamily and were incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

2. Exclusion of gene fragments

  • All repeats collected by RepeatModeler were used to search against a plant protein database where proteins from transposons were excluded. Sequences match the plants proteins as well as 50 bp flanking sequences were excluded. After the exclusion if the remainder sequences were shorter than 50 bp, the entire sequence was excluded. A package for conducting this task is available here (manual).
  • After exclusion of putative gene fragments, ModelerID.lib were considered as know TE sequences (AtBasicTE.lib)
  • AtBasicTE.lib was combined with Modelerunknown.lib (after exclusion of gene fragments) to form AtBasicAllRepeat.lib.
  • It is conceivable that AtBasicTE.lib does not contain all repeats (repeats numbers are underestimated); AtBasicAllRepeat.lib may contain sequences from novel gene families (Repeat number are overestimated).