Repeat Library Construction--Advanced

From MAKER Wiki
Jump to: navigation, search

This page describes an advanced method of repetitive element identification and classification. A basic understanding of repetitive elements is assumed. With this method, MITEs (miniature inverted transposable elements) and LTR (Long terminal repeat) elements, are first searched with structural approaches (MITE-hunter and LTRharvest). Numerically, MITEs represent the most abundant transposable elements while LTR elements occupy most genomic mass in plants. The structural approaches are followed by a de novo approach where most repetitive sequences are collected. Basic instructions for generating a species specific repeat library suitable for repeat masking prior to protein coding gene annotation with MAKER can be found here Repeat Library Construction--Basic.

Content contributed by Dr. Ning Jiang

A combination of structural-based and homology-based approach is used to maximize the opportunity for repeat collection.

Contents

1. MITEs (miniature inverted repeat transposable elements)

  • MITEs are collected using MITE-Hunter with all default parameters.
  • Among the output files, “Step8_*.fa” and “Step8_singlet.fa” are combined as the putative MITE sequences. This will be named as MITE.lib.


2. LTR (long terminal repeat) retrotransposons

  • In the plant genomes characterized so far, LTR retrotransposons represent the largest genomic mass among all repeats. As a result, it is very important to collect this type of element with high confidence.
  • Here, the elements are collected using LTRharvest and filtered by LTRdigest and other custom programs.


2.1. Collection of relatively recent LTR retrotranposons

The rationale for initial collection of recent LTR elements is that these elements contain less nested insertions and mutations as well as recombination, thus the chance for false positives is relatively low.


2.1.1. Collection of candidate elements with LTRs that are 99% or more in similarity using LTRharvest

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.out99 -outinner seqfile.outinner99 -gff3 seqfile.gff99 -minlenltr 100 -maxlenltr 6000 -mindistltr 1500 -maxdistltr 25000 -mintsd 5 -maxtsd 5 -motif tgca -similar 99 -vic 10  > seqfile.result99

DIR1= path where gt is located

seqfile = file containing genomic sequence in fasta format (please note that seqfile should be in pwd)

These two commands will enable the collection of sequences with following features:

  • Two terminal repeats which are >= 99% similar, ranging from 100 bp tp 6000 bp;
  • The two repeats end with “TG…CA”;
  • The size of entire element ranges from 1.5 kb to 25 kb;
  • The element must be flanked by a 5bp TSD (target site duplication).
  • The TSD is within 10 bp from the end of the element.


2.1.2. Using LTRdigest to find elements with PPT (poly purine tract) or PBS (primer binding site)

DIR1/gt gff3 -sort seqfile.gff99 > seqfile.gff99.sort
DIR1/gt ltrdigest -trnas ~/bin/LTR/eukaryotic-tRNAs.fa seqfile.gff99.sort seqfileindex > seqfile.gff99.dgt
  • The “eukaryotic-tRNAs.fa” can be downloaded from Genomic tRNA database.
  • The output “seqfile.gff99.dgt” provides the location of PPT or PBS in each element, if any.
  • The output is further processed so that only elements with PPT or PBS are selected. In addition, at least 50% of the PPT or PBS sequence are located in the internal regions and the distance between LTR and PPT or PBS should be no more than 20 bp.


2.1.3. Further filtering of the candidate elements

Element sequences containing significant gaps (more than 50 Ns) are excluded. Even after the above procedures, considerable amounts of false positive elements are still detected. The three most important sources of false positives are:

  • Tandem local repeats such as centromeric repeats.
  • Local gene clusters derived from recent gene duplications
  • Two other transposable elements located in adjacent regions.

A common feature of these sequences is that the alignment between the two “LTRs” often extends beyond the boundary of “LTR”. To filter out these sequences, 50 bp flanking sequences on both sides of the LTRs of each candidate elements were retrieved and aligned using MUSCLE, with default parameters.

DIRMUSCLE/muscle3.8.31_i86linux64 -in flankingseqfile -out flankingseqfile.afa

Where DIRMUSCLE is the folder (path) where the muscle program is. Please note “muscle3.8.31_i86linux64” is the current name of the program, it could be changed in the future by the authors. “flankingseqfile” represents the name of sequence file containing the upstream sequences or downstream sequences from each pair of LTRs.

Specifically, the upstream sequence of 5’ LTR was aligned with that of 3’ LTR of each candidate. So were the downstream sequences. The resulting output (flankingseqfile.afa) of MUSCLE contains the two sequences in fasta format where gaps are indicated by “-”. Comparison of each corresponding position of the two sequences would reveal, among all positions of the two sequences, the number of positions associated with identical nucleotides, different nucleotide, and indels. Based on the above comparison, sequence identity of the two sequences was calculated, which equals number of identical positions divided by alignable positions. Alignable positions were derived from total positions minus gaps or indels. If the flanking sequences on either side is alignable (25 identical bps, or half of the total nucleotides, with at least 60% identity excluding gaps), the element would be excluded.

Here is an sample output from flankingseqfile.afa:

For simplicity, I made the alignment short but the principle remains the same.

>sequpstream5LTR 
AGCACGAGAAAACAA----
>sequpstream3LTR 
----AGTGAAAACAAAAGG

With this example, there is a total 15 nt in each sequence. The total number of alignment position is 19 (11 alignable positions with 8 nt gaps). Among the 11 alignable positions, two nucleotides are different and 9 are identical. So the identity is 9/11 = 82% > 60% Since more than half of the nucleotides are identical and the identity is > 60%, the two sequences are considered alignable and this element is considered a false positive.


The output of MUSCLE is further processed so that if the flanking sequence on either side can be aligned (25 identical bps with at least 60% identity excluding gaps), the element will be excluded.


2.1.4. Identify elements with nested insertions

  • Retrotranposons are frequently nested with each other or inserted by other elements. If left unidentified, it will cause misclassification and other complications. To detect those elements, LTR sequences from candidate elements retained after steps in 2.1.3 are used to mask the putative internal regions. If LTR sequences are detected in the internal regions, it is considered as elements nested with other insertions. ******
  • If internal regions of elements match sequences in MITE.lib (see 1.), they are also considered as elements with nested insertions. ******
  • The internal regions of elements are also used to search against a transposase database of which LTR retrotransposon-related proteins are excluded. If the internal sequence has significant matches with any non-LTR transposase, it is considered as an element containing nested insertions. ******

2.1.5 Building examplars

  • LTR elements are often very repetitive in the genome, and it will be computationally challenging to retain all individual elements in the repeat library. To reduce the redundancy, representative sequences (examplars) are chosen. ****** Elements sequences are subject to an all vs. all BLASTN search. The element with the most matches (cutoff at 80% identity in 90% of the element length) is considered as the first exemplar. Thereafter, this element and its matches are excluded from the group and a second round BLASTN search is conducted with the remainder of the elements, leading to the generation of the second exemplar. This process is repeated until all elements were excluded.
  • Since the internal sequences of LTRs are usually more conserved than LTR sequences, selection of examplars are first conducted with the internal sequences of the elements without nested insertions. Thereafter the LTR sequences correspond to the examplars of internal regions are also chosen. The sequences of LTR examplars are then used to mask all LTR sequences of all candidate elements; those sequences that are masked (cutoff 80% identity in 90% of the sequence length) are excluded. The remainder of LTR sequences is used to generate a new pool of exemplar sequences.
  • All the exemplar sequences generated are combined together and called LTR99.lib


2.2. Collection of relatively old LTR retrotransposons

Collection of relatively old LTRs is enabled by reducing the similarity between LTRs to 85% (default value of LTRharvest) and not associated with terminal sequence motif.

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.out85 -outinner seqfile.outinner85 -gff3 seqfile.gff85 -minlenltr 100 -maxlenltr 6000 -mindistltr 1500 -maxdistltr 25000 -mintsd 5 -maxtsd 5 -vic 10  > seqfile.result85
  • Since the terminal sequence motif is not specified, only elements with terminal sequences with patterns that are previously reported are retained.
  • Procedures described in 2.1.2 are conducted to identify elements with PPT or PBS.
  • Since this set of elements contain elements collected in 2.1. (LTR99.lib) or elements highly similar to them, the candidate element sequences are masked by LTR99.lib and all elements that are significantly masked (cutoff at 80% identity in 90% of the element length) are excluded.
  • The remainder of elements is processed through the procedures described in 2.1.3 to 2.1.5 and the resulting exemplar sequences are called LTR85.lib.


2.3 Collection of terminal repeat retrotransposon in miniature (TRIM) - relatively recent elements.

This is analogous to the collection of larger LTR retrotransposons described above but the focus is on miniature elements. Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.outT99 -outinner seqfile.outinnerT99 -gff3 seqfile.gffT99 -minlenltr 70 -maxlenltr 500 -mindistltr 280 -maxdistltr 1500 -mintsd 5 -maxtsd 5 -motif tgca -similar 99 -vic 10  > seqfile.resultT99

Then the candidate elements are processed through procedures described in 2.1.2 to 2.1.5 and the resulting exemplar sequences are called TRIM99.lib.


2.4 Collection of terminal repeat retrotransposon in miniature (TRIM) - relatively old elements.

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.outT85 -outinner seqfile.outinnerT85 -gff3 seqfile.gffT85 -minlenltr 70 -maxlenltr 500 -mindistltr 280 -maxdistltr 1500 -mintsd 5 -maxtsd 5 -vic 10  > seqfile.resultT85
  • As described in 2.2, only elements with terminal sequences that are reported to date are retained.
  • Procedures described in 2.1.2 are conducted to identify elements with PPT or PBS.
  • The candidate element sequences are masked by TRIM99.lib and all elements that are significantly masked (cutoff at 80% identity in 90% of the element length) are excluded.
  • The remainder of elements is processed through the procedures described in 2.1.3 to 2.1.5 and the resulting exemplar sequences are called TRIM85.lib.
  • Combine LTR99.lib, LTR85.lib, TRIM99.lib, and TRIM85.lib to obtain allLTR.lib
  • A pipeline is in development to combine the procedures together.

3. Collecting repetitive sequences by RepeatModeler

The genomic sequence is masked by (allLTR.lib + MITE.lib), and the unmasked sequence is extracted and the relevant file is called umseqfile.

Commands:

DIR2/BuildDatabase -name umseqfiledb -engine ncbi umseqfile

DIR2 = path where RepeatModeler is.

"engine ncbi” refers to the NCBI blast program that is used as alignment tool.

nohup DIR2/RepeatModeler -database umseqfiledb >& umseqfile.out 
  • Among the sequences generated by RepeatModeler, some are associated with identities and others are not. These with identities are put in ModelerID.lib and the others are in Modelerunknown.lib.
  • Sequences in Modelerunknown.lib are searched against a tranposase database and sequences matching transposase are considered as transposons belonging to the relevant superfamily and are incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

4. Exclusion of gene fragments

  • All repeats collected so far are used to search against a plant protein database where proteins from transposons are excluded. Sequences match the plants proteins as well as 50 bp flanking sequences are excluded. After the exclusion if the remainder sequences are shorter than 50 bp, the entire sequence is excluded.
  • After exclusion of putative gene fragments, MITE.lib, allLTR.lib, and ModelerID.lib are combined together to form KnownRepeats.lib.
  • KnownRepeats.lib was combined with Modelerunknown.lib to form allRepeats.lib.
  • It is conceivable that KnownRepeats.lib does not contain all repeats (repeats numbers are underestimated); allRepeats.lib may contain sequences from novel gene families (Repeat number are overestimated).