Repeat Library Construction-Basic

From MAKER Wiki
Jump to navigation Jump to search

This page describes the process of generating a species specific repeat library suitable for repeat masking prior to protein coding gene annotation with MAKER. This is achieved by a repeat collection tool (RepeatModeler) that collects sequences reaching a certain copy number. The repetitive sequences are then classified based on their similarity to known transposable elements. As a result, low copy number transposable elements are not included in the collection. Moreover, a substantial amount of sequences cannot be classified. For a more comprehensive collection of repetitive elements as well as better classification see Repeat Library Construction--Advanced.

Content contributed by Dr. Ning Jiang

Building custom repeat library for plant genomes – Basic protocol

1. Collecting repetitive sequences by RepeatModeler

The genomic sequence (called seqfile,in fasta format) was processed by RepeatModeler

First command:

DIR/BuildDatabase -name seqfiledb -engine ncbi seqfile
  • DIR = path where RepeatModeler is.
  • “-engine ncbi” refers to the NCBI blast program that was used as the alignment tool.

Second command:

nohup DIR/RepeatModeler -database seqfiledb >& seqfile.out
  • After implementation of the commands, the RepeatModeler program generates a directory called “RM…”. Inside the directory there is a document called “consensi.fa.classified” that contains all the repetitive sequences. The definition line of each sequence contains the sequence name and the identity in RepeatMasker format. If the sequence is unidentified, it is marked as “Unknown”.
  • In our study, these with identities were put in ModelerID.lib and these with “Unkown” were in Modelerunknown.lib.
  • Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker package and Kennedy et al (2011)) and sequences matching transposase were considered as transposons belonging to the relevant superfamily and were incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

2. Exclusion of gene fragments

  • All repeats collected by RepeatModeler were used to search against a plant protein database where transposon protein were excluded. Sequences match the plants proteins (considered as gene fragments) as well as 50 bp flanking sequences were excluded. After the exclusion if the remainder sequences were shorter than 50 bp, the entire sequence was excluded. A package named ProtExcluder for conducting this task is available here (manual).

After exclusion of putative gene fragments, ModelerID.lib were considered as know TE sequences (AtBasicTE.lib)

  • AtBasicTE.lib was combined with Modelerunknown.lib (after exclusion of gene fragments) to form AtBasicAllRepeat.lib.
  • It is conceivable that the sequences in AtBasicTE.lib are relatively reliable transposons but this library does not contain all repeats (repeat numbers are underestimated). If this library is used, certain repeats are left out and maybe annotated as genes or portion of genes. On the other hand, AtBasicAllRepeat.lib is more comprehensive but may contain sequences from novel gene families that are not present in the existing plant protein database, so the repeat number may be overestimated in this library and novel gene families might be masked.