Repeat Library Construction-Advanced

From MAKER Wiki
Jump to navigation Jump to search

This page describes an advanced method of repetitive element identification and classification. A basic understanding of repetitive elements is assumed. With this method, MITEs (miniature inverted transposable elements) and LTR (Long terminal repeat) elements, are first searched with structural approaches (MITE-hunter and LTRharvest). Numerically, MITEs represent the most abundant transposable elements while LTR elements occupy most genomic mass in plants. The structural approaches are followed by a de novo approach where most repetitive sequences are collected. Basic instructions for generating a species specific repeat library suitable for repeat masking prior to protein coding gene annotation with MAKER can be found here Repeat Library Construction--Basic.

Content contributed by Ning Jiang, Megan Bowman, and Kevin Childs.

A combination of structural-based and homology-based approach is used to maximize the opportunity for repeat collection.

This protocol requires BioPerl, MUSCLE, RepeatMasker, Hmmer and NCBI-BLAST. Please download the programs and modify your PATH wherever relevant.

This page is in update.

1. MITEs (miniature inverted repeat transposable elements)

  • MITEs are collected using MITE-Hunter with all default parameters.
  • Among the output files, “Step8_*.fa” and “Step8_singlet.fa” are combined as the putative MITE sequences. This will be named as MITE.lib.


2. LTR (long terminal repeat) retrotransposons

  • In the plant genomes characterized so far, LTR retrotransposons represent the largest genomic mass among all repeats. As a result, it is very important to collect this type of element with high confidence.
  • Here, the elements are collected using LTRharvest and filtered by LTRdigest and other custom programs.

A fully automated package for building a library for LTR elements will be available with next update.

2.1. Collection of relatively recent LTR retrotranposons

The rationale for initial collection of recent LTR elements is that these elements contain less nested insertions and mutations as well as recombination, thus the chance for false positives is relatively low.


2.1.1. Collection of candidate elements with LTRs that are 99% or more in similarity using LTRharvest

Note: FASTA sequence IDs (sequence names) must NOT have “_” or “.” in them for this pipeline to function properly. Please format accordingly prior to running LTR harvest.

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.out99 -outinner seqfile.outinner99 -gff3 seqfile.gff99 -minlenltr 100 -maxlenltr 6000 -mindistltr 1500 -maxdistltr 25000 -mintsd 5 -maxtsd 5 -motif tgca -similar 99 -vic 10  > seqfile.result99

DIR1= path where gt is located

The current protocol is developed using GenomeTools-1.5.5.

seqfile = file containing genomic sequence in fasta format. Designate the entire path of this file or ensure that it is in the current working directory.

These two commands will enable the collection of sequences with following features:

  • Two terminal repeats which are >= 99% similar, ranging from 100 bp tp 6000 bp;
  • The two repeats end with “TG…CA”;
  • The size of entire element ranges from 1.5 kb to 25 kb;
  • The element must be flanked by a 5bp TSD (target site duplication).
  • The TSD is within 10 bp from the end of the element.


2.1.2. Using LTRdigest to find elements with PPT (poly purine tract) or PBS (primer binding site)

DIR1/gt gff3 -sort seqfile.gff99 > seqfile.gff99.sort
DIR1/gt ltrdigest -trnas ~/bin/LTR/eukaryotic-tRNAs.fa seqfile.gff99.sort seqfileindex > seqfile.gff99.dgt

The “eukaryotic-tRNAs.fa” can be downloaded from Genomic tRNA database. The output “seqfile.gff99.dgt” provides the location of PPT or PBS in each element, if any. The CRL_Step1.pl script processes the output of LTRdigest so that only elements with PPT or PBS are selected. In addition, at least 50% of the PPT or PBS sequences are located in the internal regions and the distance between LTR and PPT or PBS should be no more than 20 bp. This will ensure the boundary of the LTR is correct.

DIR_CRL = path where CRL scripts and other custom scripts are located

If any script is not in executable format, use “chmod +x” to make it executable.

Make sure Bioperl is active before running CRL scripts. For any input file, if it is not in the working directory, please move it to the working directory or specify the path.

Command:

perl DIR_CRL/CRL_Step1.pl  --gff <path to seqfile.gff.dgt file>

Example:

perl  DIR_CRL/CRL_Step1.pl –-gff  seqfile.gff99.dgt

Output: CRL_Step1_Passed_Elements.txt

CRL_Step1_Passed_Elements.txt is a text file with sequence IDs:

Example:

214_200514

739_87362

96_405925

80_527347


2.1.3. Further filtering of the candidate elements


Even after the above procedures, considerable amounts of false positive elements are still detected. The three most important sources of false positives are:

  • Tandem local repeats such as centromeric repeats.
  • Local gene clusters derived from recent gene duplications
  • Two other transposable elements located in adjacent regions.

A common feature of these sequences is that the alignment between the two “LTRs” often extends beyond the boundary of “LTR”.

The CRL_Step2.pl script removes element sequences containing significant sequencing gaps (more than 50 Ns), then 50bp flanking sequences on both sides of the remaining LTRs are retrieved.

Command:

perl DIR_CRL/CRL_Step2.pl --step1 <path to CRL_Step1_Passed_Elements.txt > --repeatfile <path to seqfile.out file> --resultfile <path to seqfile.result file> --sequencefile <path to seqfile> --removed_repeats  <CRL_Step2_Passed_Elements.fasta>

The “seqfile.out” and “seqfile.result” are from 2.1.1. For example, seqfile.out99 and seqfile.result99.

Example:

perl DIR_CRL/CRL_Step2.pl --step1 CRL_Step1_Passed_Elements.txt  --repeatfile seqfile.out99 --resultfile seqfile.result99  --sequencefile seqfile --removed_repeats   CRL_Step2_Passed_Elements.fasta

Outputs: CRL_Step2_Passed_Elements.fasta Repeat_*.fasta files

The CRL_Step2_Passed_Elements.fasta file is a FASTA file containing the element sequences without significant sequencing gaps. The Repeat_*.fasta files are small sequence files containing the 50bp flanking upstream or downstream sequence.

Example:

>chr04002_1_718645_lLTR50bp_upstream_24

CCTACCACTAGCGTCAACCTCTCTTATAACTTGCACTGTGGTCGGCCATT

>chr04002_1_718645_rLTR50bp_upstream_24

CAGTCCTTCCGAATCTCGGGGTCGAGATTCCGTTTAAGGGAGGTAGGTTT

Move the Repeat_*.fasta files and CRL_Step2_Passed_Elements.fasta into a new directory called fasta_files. The CRL_Step3 script needs to be run in the fasta_files directory.

Commands:

mkdir  fasta_files
mv  Repeat_*.fasta   fasta_files
mv  CRL_Step2_Passed_Elements.fasta  fasta_files
cd   fasta_files


Next, the CRL_Step3.pl script aligns the Repeat_*.fasta files with default parameters. Specifically, the upstream sequence of 5’ LTR was aligned with that of 3’ LTR of each candidate. So were the downstream sequences. The resulting output (flankingseqfile.afa) of MUSCLE contains the two sequences in fasta format where gaps are indicated by “-”. Comparison of each corresponding position of the two sequences would reveal, among all positions of the two sequences, the number of positions associated with identical nucleotides, different nucleotide, and indels. Based on the above comparison, sequence identity of the two sequences was calculated, which equals number of identical positions divided by alignable positions. Alignable positions were derived from total positions minus gaps or indels. If the flanking sequences on either side is alignable (25 identical bps, or half of the total nucleotides, with at least 60% identity excluding gaps), the element is excluded.

Here is a sample output from flankingseqfile.afa: For simplicity, the alignment is short but the principle remains the same. (example)

>sequpstrean5LTR

AGCACGAGAAAACAA

>sequpstrean3LTR

AGTGAAAACAAAAGG

With this example, there is a total 15 nt in each sequence. The total number of alignment position is 19 (11 alignable positions with 8 nt gaps). Among the 11 alignable positions, two nucleotides are different and 9 are identical. So the identity is 9/11 = 82% > 60% Since more than half of the nucleotides are identical and the identity is > 60%, the two sequences are considered alignable and this element is considered a false positive.

Note: the pidentity and seq_c parameters must be integers.

Make sure MUSCLE is active before running CRL_Step3.pl.

Command:

perl DIR_CRL/CRL_Step3.pl --directory <DIR/fasta_files> --step2 <CRL_Step2_Passed Elements file> --pidentity <percent identity> --seq_c <number of identical nucleotides>

Example:

perl DIR_CRL/CRL_Step3.pl --directory  /home/jiang/LTR/fasta_files --step2 CRL_Step2_Passed_Elements.fasta --pidentity 60 --seq_c 25 

Output: CRL_Step3_Passed_Elements.fasta

The CRL_Step3_Passed_Elements.fasta file is a FASTA sequence file containing element sequences that have passed the percent identity and number of identical nucleotides thresholds. This file will be located in the fasta_files directory.

Move CRL_Step3_Passed_Elements.fasta to the parental directory before proceeding to next step.

Commands:

mv  CRL_Step3_Passed_Elements.fasta  ..
cd  ..

2.1.4. Identify elements with nested insertions

  • Retrotranposons are frequently nested with each other or inserted by other elements. If left unidentified, it will cause misclassification and other complications. To detect those elements, LTR sequences from candidate elements retained after steps in 2.1.3 are used to mask the putative internal regions. If LTR sequences are detected in the internal regions, it is considered as elements nested with other insertions. ******
  • If internal regions of elements match sequences in MITE.lib (see 1.), they are also considered as elements with nested insertions. ******
  • The internal regions of elements are also used to search against a transposase database of which LTR retrotransposon-related proteins are excluded. If the internal sequence has significant matches with any non-LTR transposase, it is considered as an element containing nested insertions. ******

2.1.5 Building examplars

  • LTR elements are often very repetitive in the genome, and it will be computationally challenging to retain all individual elements in the repeat library. To reduce the redundancy, representative sequences (examplars) are chosen. ****** Elements sequences are subject to an all vs. all BLASTN search. The element with the most matches (cutoff at 80% identity in 90% of the element length) is considered as the first exemplar. Thereafter, this element and its matches are excluded from the group and a second round BLASTN search is conducted with the remainder of the elements, leading to the generation of the second exemplar. This process is repeated until all elements were excluded.
  • Since the internal sequences of LTRs are usually more conserved than LTR sequences, selection of examplars are first conducted with the internal sequences of the elements without nested insertions. Thereafter the LTR sequences correspond to the examplars of internal regions are also chosen. The sequences of LTR examplars are then used to mask all LTR sequences of all candidate elements; those sequences that are masked (cutoff 80% identity in 90% of the sequence length) are excluded. The remainder of LTR sequences is used to generate a new pool of exemplar sequences.
  • All the exemplar sequences generated are combined together and called LTR99.lib


2.2. Collection of relatively old LTR retrotransposons

Collection of relatively old LTRs is enabled by reducing the similarity between LTRs to 85% (default value of LTRharvest) and not associated with terminal sequence motif.

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.out85 -outinner seqfile.outinner85 -gff3 seqfile.gff85 -minlenltr 100 -maxlenltr 6000 -mindistltr 1500 -maxdistltr 25000 -mintsd 5 -maxtsd 5 -vic 10  > seqfile.result85
  • Since the terminal sequence motif is not specified, only elements with terminal sequences with patterns that are previously reported are retained.
  • Procedures described in 2.1.2 are conducted to identify elements with PPT or PBS.
  • Since this set of elements contain elements collected in 2.1. (LTR99.lib) or elements highly similar to them, the candidate element sequences are masked by LTR99.lib and all elements that are significantly masked (cutoff at 80% identity in 90% of the element length) are excluded.
  • The remainder of elements is processed through the procedures described in 2.1.3 to 2.1.5 and the resulting exemplar sequences are called LTR85.lib.


2.3 Collection of terminal repeat retrotransposon in miniature (TRIM) - relatively recent elements.

This is analogous to the collection of larger LTR retrotransposons described above but the focus is on miniature elements. Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.outT99 -outinner seqfile.outinnerT99 -gff3 seqfile.gffT99 -minlenltr 70 -maxlenltr 500 -mindistltr 280 -maxdistltr 1500 -mintsd 5 -maxtsd 5 -motif tgca -similar 99 -vic 10  > seqfile.resultT99

Then the candidate elements are processed through procedures described in 2.1.2 to 2.1.5 and the resulting exemplar sequences are called TRIM99.lib.


2.4 Collection of terminal repeat retrotransposon in miniature (TRIM) - relatively old elements.

Commands:

DIR1/gt suffixerator -db seqfile -indexname seqfileindex -tis -suf -lcp -des -ssp –dna
DIR1/gt ltrharvest -index seqfileindex -out seqfile.outT85 -outinner seqfile.outinnerT85 -gff3 seqfile.gffT85 -minlenltr 70 -maxlenltr 500 -mindistltr 280 -maxdistltr 1500 -mintsd 5 -maxtsd 5 -vic 10  > seqfile.resultT85
  • As described in 2.2, only elements with terminal sequences that are reported to date are retained.
  • Procedures described in 2.1.2 are conducted to identify elements with PPT or PBS.
  • The candidate element sequences are masked by TRIM99.lib and all elements that are significantly masked (cutoff at 80% identity in 90% of the element length) are excluded.
  • The remainder of elements is processed through the procedures described in 2.1.3 to 2.1.5 and the resulting exemplar sequences are called TRIM85.lib.
  • Combine LTR99.lib, LTR85.lib, TRIM99.lib, and TRIM85.lib to obtain allLTR.lib
  • A pipeline is in development to combine the procedures together.

3. Collecting repetitive sequences by RepeatModeler

The genomic sequence is masked by (allLTR.lib + MITE.lib), and the unmasked sequence is extracted and the relevant file is called umseqfile.

Commands:

DIR2/BuildDatabase -name umseqfiledb -engine ncbi umseqfile

DIR2 = path where RepeatModeler is.

"engine ncbi” refers to the NCBI blast program that is used as alignment tool.

nohup DIR2/RepeatModeler -database umseqfiledb >& umseqfile.out 
  • Among the sequences generated by RepeatModeler, some are associated with identities and others are not. These with identities are put in ModelerID.lib and the others are in Modelerunknown.lib.
  • Sequences in Modelerunknown.lib are searched against a tranposase database and sequences matching transposase are considered as transposons belonging to the relevant superfamily and are incorporated into ModelerID.lib and excluded from Modelerunknown.lib.

4. Exclusion of gene fragments

  • All repeats collected so far are used to search against a plant protein database where proteins from transposons are excluded. Sequences match the plants proteins as well as 50 bp flanking sequences are excluded. After the exclusion if the remainder sequences are shorter than 50 bp, the entire sequence is excluded.
  • After exclusion of putative gene fragments, MITE.lib, allLTR.lib, and ModelerID.lib are combined together to form KnownRepeats.lib.
  • KnownRepeats.lib was combined with Modelerunknown.lib to form allRepeats.lib.
  • It is conceivable that KnownRepeats.lib does not contain all repeats (repeats numbers are underestimated); allRepeats.lib may contain sequences from novel gene families (Repeat number are overestimated).