Repeat Library Construction--Basic

Content contributed by Dr. Ning Jiang

Building custom repeat library for plant genomes – Basic protocol

1. Collecting repetitive sequences by RepeatModeler

The genomic sequence (called seqfile) was processed by RepeatModeler

First command:

DIR/BuildDatabase -name umseqfiledb -engine ncbi seqfile

Second command:

nohup DIR/RepeatModeler -database seqfiledb >& seqfile.out

Among the sequences generated by RepeatModeler, some were associated with identities and others were not. These with identities were put in ModelerID.lib and the others were in Modelerunknown.lib.
Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker package and Kennedy et al (2011)) and sequences matching transposase were considered as transposons belonging to the relevant superfamily and were incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

All repeats collected by RepeatModeler were used to search against a plant protein database where proteins from transposons were excluded. Sequences match the plants proteins as well as 50 bp flanking sequences were excluded. After the exclusion if the remainder sequences were shorter than 50 bp, the entire sequence was excluded. A package for conducting this task is available here (manual).
After exclusion of putative gene fragments, ModelerID.lib were considered as know TE sequences (AtBasicTE.lib)
AtBasicTE.lib was combined with Modelerunknown.lib (after exclusion of gene fragments) to form AtBasicAllRepeat.lib.
It is conceivable that AtBasicTE.lib does not contain all repeats (repeats numbers are underestimated); AtBasicAllRepeat.lib may contain sequences from novel gene families (Repeat number are overestimated).