CONTACT  |  SITE MAP  |  ABOUT US   
Ask an account
You are here : Home / Home URGI / About us / Publications / 2013 / New tools and strategy for TEs annotation in large genomes

2013

National,  COM (talks)

XVIIIe CNET (Congrès National sur les Elements Transposables), Montpellier, 1-3 July 2013

In GDR 3546 frame 'Les Eléments Génétiques Mobiles : du mécanisme aux populations, une approche intégrée'

01 Jul 2013   New tools and strategy for TEs annotation in large genomes

V. Jamilloux, S. Arnoux, O. Inizan, T. Chaumier, M. Moissette, H. Quesneville

The recent successes of new sequencing technologies allow today to sequence increasingly large genomes at reduced costs. Transposable elements (TEs) constitute the most structurally dynamic components and the largest portion of nuclear sequences of these large genomes, e.g. 85% of the maize genome (Schnable et al. 2009), and 88% of the wheat genome (Choulet et al. 2010). Therefore, TEs annotation should be considered as a major task in these genome projects.

However, this still remains a major challenge, since a good TE annotation relies critically on an expertly assembled reference sequence set, data that currently cannot be obtained in an automatic way. This crucial step is now a bottleneck for many genome analyses. To this end, we scale up a repeats detection and annotation pipeline, both part of the REPET package (Flutre et al. 2011, now at its v2.1 release). The two pipelines called TEdenovo and TEannot respectively build  a TEs consensus library and annotate TE copies in the genome. Recently, TEdenovo has been improved by integrating LTRHarvest (Ellinghaus D et al. 2008) in its framework for a fast and accurate LTR-TE detection. In addition, we propose now three new pipelines, one based on Tallymer (Kurtz S et al 2009), called TallymerPipe, as pre-processing tool for a fast repeated regions detection, another, called PostAnalyseTElib,that gets information about a TE library, and finally, SegDup, a pipeline to detect segmental duplications.
Using these pipelines, we apply a new strategy, to cope with very large genomes such as the wheat, an allohexaploid with three homoeologous genomes. It is one of the largest plant genomes with ~17Gbp and 88% of TEs (Choulet, 2010). We started with the 3B chromosome, the first to be fully sequenced. This strategy is an iterative approach and can be summarized as following:
1) Detection of the most easy to found TEs, with stringent parameters, to build a first TEs consensus library. They often corresponds to young TEs and the less degenerate ones,
2) TE annotation and splicing of the corresponding sequences from the initial contigs. Then we obtain a reduced genome sequence.
3) Detection of the other TEs with sensitive parameters on the reduced genome to build a second TEs consensus library.
4) Annotation of the original contigs with the concatenation of the two TEs libraries.
The rational here is that these large genomes are mostly made of few TE families easy to found because present in number of copies. They will be detected in the first step and this will allow reducing the genome size by an important factor. Using this approach we were able to reduce the genome from 986Mbp to ~230Mb, a reasonable size for a detection of TEs with sensitive parameters.
 Ref:
Choulet, F, T Wicker, C Rustenholz, et al. 2010. Megabase level sequencing reveals contrasted organization and evolution patterns of the wheat gene and transposable element spaces. Plant Cell 22:1686-1701.
Flutre, T, E Duprat, C Feuillet, H Quesneville. 2011. Considering transposable element diversification in de novo annotation approaches. PLoS One 6:e16526.
Schnable, PSD WareRS Fulton, et al. 2009. The B73 maize genome: complexity, diversity, and dynamics. Science 326:1112-1115.
Ellinghaus D, Kurtz S, Willhoeft U 2008. LTRHarvest, an efficient and flexible software for denovo detection of LTR retrotransposons. BMC Bioinformatics doi:10.1186/1471-2105-9-18
Kurtz S, Narechania A, Stein J-C, Ware D 2009. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics, 9:517

Update: 16 Sep 2020
Creation date: 09 Feb 2013