Study : Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome

Identification

Name
Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome
Identifier
dXJuOkVWQS9zdHVkeS9QUkpOQTMwMTA1NQ==
Source
Description
Affordable, deep transcriptome sequencing comes with challenges including assembling reads that are shorter (~100bp) than the average mRNA transcript (~1000bp) and transcript abundance that spans orders of magnitude. Furthermore, no broadly accepted methods for evaluating de novo transcriptome assemblies have been established and rigorously tested with a high-quality reference genome. Here we present a detailed comparison of 99 transcriptome assemblies, generated with 6 de novo assemblers including CLC, Trinity, SOAP, Oases, ABySS and NextGENe (and reference guided assembly controls). We identify superior de novo assemblies using the Arabidopsis thaliana and Oryza sativa genomes, and recommend reference-independent quality metrics for de novo assembly validation. The leading assemblers are reassuringly good and robust to known assembly challenges. We identify novel assembly challenges manifested as high rates of false non-assembly in highly expressed genes. Surprisingly, the instance of true chimeric assemblies is very low for all assemblers, though non-assembly of closely related genes occurs in all assemblies, including reference-based controls. Normalized libraries are reduced in highly abundant transcripts, yet also lack 1000s of low abundance transcripts. Superior transcriptome assemblies are identified by only the combination of 1) a high proportion of reads that map to an assembly 2) efficient recovery of conserved genes, 3) expected N50 length statistics, and 4) more unigenes than expected transcripts. A key reference-based analysis allows visualization of assembly quality as a function of sequencing depth and shows a clear ranking of assemblers. de novo assembly of the Arabidopsis leaf transcriptome revealed ~20 putative Arabidopsis genes lacking in the current annotation. We estimated gene expression using de novo RNA-Seq and microarrays; RNA-Seq was superior and we report a novel strategy to improve the correlation of expression estimates between microarray and RNA-Seq. We provide benchmark Illumina transcriptome data and SCERNA, a broadly applicable modular pipeline for de novo assembly improvement.

Genotype

Accession number Name Taxon