Genoak consortium (restricted)
Access restricted to GenOak project consortium
Genome assemblies: fasta files
Assemblies | Qrob_V1 (Plomion et al. 2016) | Qrob_V2 | Qrob_H2.3 (Haplome) | Qrob_pseudomecules v2 | Qrob_unassigned |
Assembly size (bp) | 1,354,311,717 | 1,455,104,916 | 814,282,569 | 716,731,785 | 97,636,684 |
# Scaffolds | 17,910 | 8,827 | 1,409 | 12 Chr (871 scaffolds) | 538 |
N50 | 256,640 | 821,707 | 1,342,530 | 57,352,617 | 621,902 |
L50 | 1,468 | 537 | 192 | 5 | 37 |
N90 | 35,065 | 198,501 | 333,129 | 44,977,106 | 96,968 |
L90 | 6,626 | 1,880 | 649 | 10 | 195 |
Max length (bp) | 1,922,255 | 5,542,037 | 5,871,596 | 115,639,695 | 2,943,817 |
Min length (bp) | 2003 | 2000 | 2095 | 39,860,516 | 2,095 |
#N (% assembly) | 156,586,910 (11.6%) | 67,010,588 (4.6%) | 23,989,938 (2.9%) | 20,278,712 (2.8%) | 3,790,126 (3.9%) |
N50: shortest sequence length at 50% of the genome
L50: number of scaffolds whose summed length is N50
Gene prediction statistics
Assembly / gene release | v1 / v1 | v2 / v2.2 | Haplome_2.3 | ||||
Gene quality | regular | unreliable | regular & manual | Low confidence | regular & manual | Low confidence | Final list of genes |
#Genes | 110,058 | 79052 | 43755 | 25808 | |||
58,778 | 51,280 | 53,484 | 23,547 | 29,665 | 13,575 | 25808 | |
# Uncomplete genes | 4,500 | 2,021 | 515 | 0 | |||
Gene space (Mb) | 120.6 | 22.5 | 150 | 20 | 83.5 | 11.7 | 75 |
Gene mean / median size (bp) | 2,051 / 1,530 | 439 / 291 | 2,809 / 2,050 | 851 / 417 | 2,813 / 2,055 | 858 / 422 | 2,907 / 2,137 |
CDS mean / median size (bp) | 1,025 / 810 | 266 / 243 | 1,062 / 831 | 286 / 273 | 1,068 / 831 | 286 / 273 | 1,174 / 942 |
#Polypeptide < 500 bp | 10,881 | 51,280 | 12,847 | 23,547 | 7,000 | 13,575 | 4,367 |
#Polypeptide > 3Kb | 1,695 | NA | 2,019 | NA | 1,162 | NA | 1,162 |
#genes with introns | 44,595 (76%) | 18,999 (37%) | 42,627 (80%) | 13,313 (57%) | 23,723 (80%) | 7,736 (57%) | 20,297 (79%) |
# introns per gene | 2.7 | 0.5 | 3.2 | 0.9 | 3.2 | 0.9 | 3.3 |
Transcript sequences
- Oak contigs: OCV4_assembly_final_without_slash.fasta (196M)Oak (Quercus petraea and Quercus robur) cDNA libraries used for Sanger, 454 Roche and Illumina sequencing assembled in contigs.
Pseudomolecule + unassigned scaffolds: JBrowse , OakMine_PM1N
Haplome release v2.3 gene release October 2016/11/25
We use the results of a first round of orthoMCL+ CAFE results to curate the set of genes.
We deleted many genes associated to TE, unreliable genes and small regular genes splitted or very badly predicted belonging to clusters containing only Qrob genes (not any of the other 15 species in analysis). Finally we kept 25808 predicted protein
Quercus Robur gene information file
- 20161125_Quercus_robur_gene_information.xlsx (11.41 MB)
List of genes deleted
- Qrob_H2.3_Genes_v2.2_2delete.tsv (499.34 kB)
GFF file
- Qrob_H2.3_Genes_v2.2_20161004.gff.zip (4.82 MB)
Coding sequences (Nucleotides + Amino acid sequences
- Qrob_H2.3_Genes_v2.2_20161004.CDS_nuc.fsa.zip (8.53 MB)
- Qrob_H2.3_Genes_v2.2_20161004.CDS_prot.fsa.zip (5.39 MB)
- GFF file of functional annotation of the 25808 predicted proteins (available in OakMine_PM1N ) : : InterMine_Addesc_AddQual_AddQTL_v3.csv.bz2
Haplome release v2.3 unfiltered gene release
43755 genes (out of 79052 from v2) were mapped on Haplome v2.3 assembly. Genes IDs and structure are inchanged.
Some tags could have been modified between Gene in V2 and gene in Haplome for 28 new genes tagged uncomplete (due to their percentage of N (>20%) in coding sequence or due to a stop in frame).
- regular: 28488 genes with [N<20%] AND [[length > 500 nt] OR [gene with length < 500 nt with oak transcript evidence > 90% of coverage]]
- manual_v1: 1130 genes predicted using mapping of gene from manual curation in v1
- manual_v2: 47 genes manually curated in v2.2
- unreliable: 13575 "low confidence" genes with length < 500 nt without transcript evidence > 90% of coverage
- uncomplete: 515 genes [without start or/and stop] OR [N>20%]
GFF file of Gene prediction H2.3 (Eugene + manual annotation) on assmelby release Haplome v2.3
- Qrob_H2.3_Genes_v2.2_20151214.gff.gz (7.00 MB)
Coding (uncomplete not included) sequences predicted on Haplome v2.3 (43240 proteins)
- Nucleotidic sequences: Qrob_H2.3_Genes_v2.2_CDS_nuc.fsa.gz (10.16 MB)
- Amino acid sequences: Qrob_H2.3_Genes_v2.2_CDS_prot.fsa.gz (6.48 MB)
- GFF file of functional annotation of the 43240 predicted proteins (available in OakMine_H ) : Download
- List of proteins with description (Based on phytozome method) when found (based on EC number, KEGG orthology, PANTHER, PFAM) : Qrob_H2.3_Genes_v2.2_CDS_prot_definition.csv.gz (956.65 kB)
List of 2088 manually predicted genes in V2.2 (2067 using 1567 from V1 + 21 only in V2). 1181 genes out of these genes were recovered in Haplome. four out of them tagged uncomplete (stop in frame), thus not translated.
- Manually annotated genes: Location in assembly V2 and haplome. Gene symbol and description are given when available: Qrob_V1_v2_haplome_AnnotV1.csv (237.53 kB)
List of 35297 genes in V2 not recovered in Haplome
- Qrob_Genes_v2.2_notIn_H2.3_20151214.lst (551.52 kB)
GFF file of TEannotation recovered in Haplome v2.3
Counterpart between V2 (diploid) and Haplome V2.3
- 1:1 relationships between both version (1 gene in V2 corresponds to 1 gene in Haplome V2.3, i.e. already merged alleles in V2):
- 2:1 relationships between both version (2 genes in V2 correspond to 1 gene in Haplome V2.3, i.e. allelic pairs) :
Assembly release v2.2 gene prediction : JBrowse , OakMine
79052 genes were predicted and tagged using different flags of quality reported in GFF file in tag gene_qual
- regular: gene with length > 500 nt OR (gene with length < 500 nt with oak transcript evidence > 90% of coverage)
- manual_v1: 1990 genes predicted using mapping of gene from manual curation in v1
- manual_v2: 98 genes manually curated in v2.2
- unreliable: gene with length < 500 nt without transcript evidence > 90% of coverage
- uncomplete: gene without start or/and stop
Genes updated between gene prediction V2 and V2.2: 150 genes manually curated or restored from eugene prediction to recover good ORF. 4 gene were deleted. Qrob_v2_Genes_v2.2_20151202_genes_modified.lst (2.34 kB)
GFF file of Gene prediction V2.2 (Eugene + manual annotation) on assembly release V2.
- Qrob_v2_Genes_v2.2_20151202.gff.gz (11.68 MB)
Coding (uncomplete not included) sequences predicted (V2.2) on assembly release V2.
- Nucleotidic sequences: Qrob_v2_Genes_v2.2_CDS_nuc_20151202.fasta.gz (17.61 MB)
- Amino acid sequences: Qrob_v2_Genes_v2.2_CDS_prot_20151202.fasta.gz (11.60 MB)
Other annotation files (Annotation on assembly release V2)
- Transposable elements: Qrob_v2_TE_annot_FLF2_libv1_filtered_F3.gff.gz (45.75 MB)
- List of 56 genes overlapping gap in assembly and containing N>20% in their coding sequence Qrob_v2_Genes_v2.2_20151202_wN_content.csv.gz (1.08 kB)
Assembly release v1 JBrowse
Genes (Coding sequences) predicted by Eugene :
These fasta files contain reliable and unreliable Eugene predicted gene (without UTRs)- Qrob_Pxxxxxxx.1 : gene with length > 500 nt OR (gene with length < 500 nt with oak transcript contig evidence)- Qrob_uPxxxxxxx.1 : gene with length < 500 nt without transcript evidence (at the time of prediction pipeline)
- Nucleotidic sequences: Qrob_v1_scaffold_EGN_20140910.CDS_nuc.fsa (68M)
- Amino acids sequences: Qrob_v1_scaffold_EGN_20140910.CDS_prot.fsa (24M)
List of plant species used to detect expansion/contraction of gene families in oak
scientific name | species acronyme | number of predicted proteins | version | reference (doi) | |
Quercus robur | Qr | 43240 | haplome (v2.3) | this study | |
Malus domestica | Md | 63514 | v1.0 | 10.1038/ng.654 | Phytozome11 |
Prunus persica | Pp | 27864 | v2.1 | 10.1038/ng.2586 | |
Populus trichocarpa | Pt | 41335 | v3.0 | 10.1126/science.1128691 | |
Citrus clementina | Cc | 24533 | v1.0 | 10.1038/nbt.2906 | |
Fragaria vesca | Fv | 32831 | v1.1 | 10.1038/ng.740 | |
Arabidopsis lyrata | Al | 32657 | v1.0 | 10.1038/ng.807 | |
Solanum tuberosum | St | 35119 | v3.4 | 10.1038/nature10158 | |
Arabidopsis thaliana | At | 27416 | TAIR10 | 10.1038/35048692 | |
Ricinus communis | Rc | 31221 | v0.1 | 10.1038/nbt.1674 | |
Glycine max | Gm | 56044 | Wm82.a2.v1 | 10.1038/nature08670 | |
Vitis vinifera | Vv | 26346 | Genoscope.12X | 10.1038/nature06148 | |
Carica papaya | Cp | 27584 | ASGPBv0.4 | 10.1038/nature06856 | |
Theobroma cacao | Tc | 29452 | v1.1 | 10.1186/gb-2013-14-6-r53 | |
Eucalyptus grandis | Eg | 36376 | v2.0 | 10.1038/nature13308 | |
Citrullus lanatus | Wa | 23440 | v1 | 10.1038/ng.2470 | Download |