Annual meeting of the society for molecular biology and evolution (SMBE) in Lyon, France, 4-8th of July 2010.
04 Jul 2010 Exploring the diversification of TE families via de novo approaches in whole-genome sequencesTimothée Flutre, Elodie Duprat, Catherine Feuillet, Hadi Quesneville
Transposable elements (TEs) are repeated genomic sequences almost ubiquitous among prokaryote and eukaryote genomes that have a large impact on genome evolution. They are recognized as key players in genome structure dynamics, and can also be viewed as “controlling” elements involved in epigenetics mechanisms and the tinkering of gene regulatory networks. As the number of genome sequences is steadily increasing, automatic de novo approaches are needed to accurately annotate TEs and overcome the challenge of detecting nested and fragmented TEs in large and repetitive genomes. To reach this aim, we compared several programs and implemented a combined approach, the TEdenovo pipeline, which was integrated into the REPET package. The comparative analysis was performed on the Drosophila melanogaster release 4 and the Arabidopsisthaliana release 9 genomes for which high-quality TE reference sequences and annotations are available.
Since we are interested in TE dynamics within their genomic ecosystem, the goal of our approach is to recover, for each TE family, the complete ancestral sequence that transposed, rather than truncated or artifactual versions of it. The TEdenovo pipeline proceeds in 3 steps: (1) it searches for repeats via a self-alignment of the input genomic sequences, (2) it clusters the resulting high-scoring segment pairs, and (3) it builds consensus from the multiple sequence alignments. At the crucial step of repeats clustering, we show that only a combination of specific algorithms dedicated to the clustering of interspersed repeats can detect most of the TE families with a good recovery of the reference sequences. On D. melanogaster, the de novo library reached 93% sensitivity and 81% specificity when compared to its reference data-bank, whereas the values were of 74% sensitivity and 72% specificity on A. thaliana.
As the TEdenovo pipeline combines several clustering algorithms, we implemented a TE classifier to detect TE features on de novo consensus sequences, and to remove redundant consensus based on their classification. This procedure filters out false-positives and gives useful information to support manual curation. Once the de novo library was produced, we used it to mine the genome via the TEannot pipeline, also part of the REPET package. This tool combines several programs to detect TE fragments and connect them when they belonged to the same TE copy, hence resolving complex insertion patterns. The final annotation achieved with the de novo consensus reached 91% sensitivity and 97% specificity on D. melanogaster, and 87% and 92% on A. thaliana.
This analysis enabled us to thoroughly assess the results of de novo approaches as well as highlight the structural diversification within TE families, a process reflecting the tempo and mode of TEs evolution.