Difference between revisions of "Repeat Library Construction--Basic"

Revision as of 19:34, 28 September 2013

Content contributed by Dr. Ning Jiang

Building custom repeat library for plant genomes – Basic protocol

1. Collecting repetitive sequences by RepeatModeler

The genomic sequence (called seqfile,in fasta format) was processed by RepeatModeler

First command:

DIR/BuildDatabase -name seqfiledb -engine ncbi seqfile

DIR = path where RepeatModeler is.
“-engine ncbi” refers to the NCBI blast program that was used as the alignment tool.

Second command:

nohup DIR/RepeatModeler -database seqfiledb >& seqfile.out

After implementation of the commands, the RepeatModeler program generates a directory called “RM…”. Inside the directory there is a document called “consensi.fa.classified” that contains all the repetitive sequences. The definition line of each sequence contains the sequence name and the identity in RepeatMasker format. If the sequence is unidentified, it is marked as “Unknown”.
In our study, these with identities were put in ModelerID.lib and these with “Unkown” were in Modelerunknown.lib.
Sequences in Modelerunknown.lib were searched against a transposase database (derived from RepeatMasker package and Kennedy et al (2011)) and sequences matching transposase were considered as transposons belonging to the relevant superfamily and were incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

2. Exclusion of gene fragments

All repeats collected by RepeatModeler were used to search against a plant protein database where proteins from transposons were excluded. Sequences match the plants proteins as well as 50 bp flanking sequences were excluded. After the exclusion if the remainder sequences were shorter than 50 bp, the entire sequence was excluded. A package for conducting this task is available here (manual).
After exclusion of putative gene fragments, ModelerID.lib were considered as know TE sequences (AtBasicTE.lib)
AtBasicTE.lib was combined with Modelerunknown.lib (after exclusion of gene fragments) to form AtBasicAllRepeat.lib.
It is conceivable that AtBasicTE.lib does not contain all repeats (repeats numbers are underestimated); AtBasicAllRepeat.lib may contain sequences from novel gene families (Repeat number are overestimated).

@@ Line 3: / Line 3: @@
 Building custom repeat library for plant genomes – Basic protocol
-== 1.	Collecting repetitive sequences by [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler] ==
+== 1.   Collecting repetitive sequences by [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler] ==
-The genomic sequence  (called seqfile) was processed by RepeatModeler
+The genomic sequence  (called seqfile,in fasta format) was processed by RepeatModeler
 First command:
-  DIR/BuildDatabase -name umseqfiledb -engine ncbi seqfile
+  DIR/BuildDatabase -name seqfiledb -engine ncbi seqfile
 *DIR = path where RepeatModeler is.
-*“-engine ncbi” refers to that NCBI blast program was used as alignment tool.
+*“-engine ncbi” refers to the NCBI blast program that was used as the alignment tool.
 Second command:
   nohup DIR/RepeatModeler -database seqfiledb >& seqfile.out
-*Among the sequences generated by RepeatModeler, some were associated with identities and others were not. These with identities were put in ModelerID.lib and the others were in Modelerunknown.lib.
+*After implementation of the commands, the RepeatModeler program generates a directory called “RM…”. Inside the directory there is a document called “consensi.fa.classified” that contains all the repetitive sequences. The definition line of each sequence contains the sequence name and the identity in RepeatMasker format. If the sequence is unidentified, it is marked as “Unknown”.
+*In our study, these with identities were put in ModelerID.lib and these with “Unkown” were in Modelerunknown.lib.
 *Sequences in Modelerunknown.lib were searched against a transposase database (derived from [http://www.repeatmasker.org/ RepeatMasker] package and [http://www.ncbi.nlm.nih.gov/pubmed/21535899 Kennedy et al (2011)]) and sequences matching transposase were considered as transposons belonging to the relevant superfamily and were incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

Difference between revisions of "Repeat Library Construction--Basic"

Revision as of 19:34, 28 September 2013

1. Collecting repetitive sequences by RepeatModeler

2. Exclusion of gene fragments

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools