ChangeLog of REPET REPET release 3.0 (release date 28/11/2019) New features: - pipelines are adapted for Slurm and PBS schedulers. - /!\ Warning: TEdenovo.cfg some parameters are changed /!\ - PASTEClassifier: new format (12 fields) of classif file, new stats file and post treatments classifying LTR at super-family if they are Gypsy or Copia. - TEdenovo Step 4: At the end of this step, the consensus name is shorten to be the definitive identifiant, used in the followed steps. The fasta consensus file name is *_consensus_shortH.fa. - TEdenovo Step 6: use the new classif format and post treatments are adapted - TEannot Step 1: adaptation to obtain the same shuffled genome. - TEannot Step 7: after long join procedure, homogeneous identity percentage calculation and stats calculation of annotation (coverage %) - TEannot Step 8: in 9th GFF3 field, New alignIdentity % and alignLength added on each match line New tools: - PreProcess.py: to analyze and to format input genome - new RepBase parser for REPET, with option to exclude sequences list. - ConvertClassif9ColTo12Col.py: to convert classif file, with 9 colomns, obtain by Repet version until 2.5 to classif with 12 colomns from Repet version 3.0. - statPASTECNew.py: to calculate new stats on classif file (12 colomns) from 3.0 Repet version - ClassifExtractFromFastaFile.py : Extract a classif subset from fasta file - GetSequenceFromAnnotation can use gff3 files tools adaptation: - CreateGFF3sForClassifFeatures.py: Manage several classif format (9 or 12 colomns), reverse features for reversed consensus, ressources parameter is added in its configuration file. Bug fix: - tests added to detect dependances presence/absence - Adaptation to new output format of RM v4.0.* - more tests about /scratch usage as tmp directory - .... ****Au moment de la release, Penser à changer le no et la date de version***** REPET release 2.5 (release date 31/03/2016) New features: - TEdenovo Step 3: Grouper,one of three clustering tools, is modified to use multithreading, so you need a station or PC with at least 2 bi-cores cpus. - TEdenovo Step 6: Several stats files (in text format) from consensus library. - TEdenovo Step 7: Stats files (in text format) from filtered consensus library. - TEannot Step 1: the seed option of shuffle tool is set to 1, so it will be generate always the same random genome - MaskSeqFromCoord.py, for masking genome from annotation using repet output REPET release 2.4 (release date 22/12/2015) Bug fixes: - inserted DmelChr4.fa data set in db directory. New features: - All READMEs, tutorials of pipelines and tools are in doc directory with only one CHANGELOG file containing all pipelines changes - Tools - CreateGFF3sForClassifFeatures.py to create gff3 files to load in a genome browser for the visualization of features on consensus - TEdenovo Step 3: Added optional parallelization of Grouper clustering - TEannot Step 3: Score threshold is now the 95th percentile of random matches scores instead of the 99th which is too stringent - TEannot Step 7: Optimization time in Long-join procedure REPET release 2.3 (release date 17/10/2014) Bug fixes: - MySQL connection port provided by the user was not used when exporting some tables to files - MySQL connection port provided by the user was not used when using a configuration file in certain conditions - Few tables were not using MyISAM as an engine - TEdenovo Step 6: corrected bad RepeatScout headers formatting with "reversed" tag - PASTEClassifier: now sets environment variables from configuration file New features: - improve database performances - overall performances improvement - overall memory usage decrease - Tools - SplicerTesFromAnnotation.py: tool to splice annotations from genome. These annotations are Full Length Copies or Full Length Fragments according to consensus (or TEs) library - Tools - TELibFormatter.py: tool to format headers of TEs banks (in fasta format) to be usable by REPET pipelines - TEdenovo Step 1: Now removes N-stretches from chunks prior to alignment - TEdenovo Step 5: Added possibility to use several nucleotide, amino acid and proteic HMM banks at the same time - TEannot Step 1: Now removes N-stretches from chunks prior to alignment - TEannot Step 1: Now creates more homogeneous batches - TEannot Step 2: Default Blaster sensitivity is set to 2 instead of 4 - TEannot Step 3: Score threshold is now the 99th percentile of random matches scores instead of the 95th - TEannot Step 8: New way to eliminate overlapping between annotations in GFF file - PASTEClassifier: Replace -r (reverse complement), -w (add Wicker code) command line options with rev_complement and add_wicker_code attributes in configuration file - PASTEClassifier: Add add_noCat_bestHitClassif attribute in configuration file for classify noCat consensus with consensus well classified REPET release 2.2 (release date 13/09/13) - improve job and database management: tables engines is set to MyISAM - MCL: Is now integrated into PostAnalyseTElib as an alternative clustering method - LaunchMCL: Added options to set Query coverage & handle matcher inputs - Added LaunchRepeatScout tool wrapping the different required steps of RepeatScout Bug fixes: - TEannot - FilterAlign : added subject length filtering - PASTEClassifier - limit_job_nb option correction New features: - TEannot - Step 3: Set all Blaster E-Values to 0 to be consistent with RepeatMasker and Censor - TEannot - Step 4: Added TRFmaxPeriod option for TRF in configuration file - SegDup_pipe - Initial release - Tallymer_pipe - Initial release REPET release 2.1 (release date 01/02/13) Bug fixes: - TEdenovo step 3 struct : orient sequences in each cluster - TEdenovo step 3 : correction of the n longest sequences per group filter for Grouper - TEdenovo step 4 : handle empty files after filter clustering - TEdenovo step 6 : handle PASTEC coverage parameters from configuration file - TEannot step 7 : handle identity > 100 from Censor - correction of the cleaning system when jobs files are copied on node New features: - TEdenovo step 2 struct: add structural parameters for LTRharvest in configuration file - TEdenovo step 6 : improve PASTEC classification and add thresholds parameters in configuration file - TEdenovo step 6 : improve consensus redundancy removal - TEdenovo step 7 : add filtering parameters in configuration file - TEannot step 2 : allow to launch Censor with NCBI-BLAST - TEannot step 2 : allow to launch RepeatMasker with NCBI-BLAST - TEannot step 3 : improve matches statistical filter based on random hits - TEannot step 4 : allow to launch RepeatMaskerSSR with crossmatch - TEannot step 8 : remove redundant features in the GFF3 file and add GFF3 formatting option in configuration file - add TallymerPipe tool as pre-processing tool for fast repetition discovery - add PostAnalyseTElib tool as post-processing tool (start to merge ClusterConsensus and GiveInfoTEannot tools) - add SegDup pipeline as a tool to discover segmental duplications in genomes - improve database management - improve job management - overall performances improvement, especially in the TEdenovo step 6 REPET release 2.0 (release date 23/02/12) Bug fixes: - TEdenovo step 1 : check headers - TEdenovo step 3 : correction of the n longest sequences per group filter - TEdenovo step 4 : in headers, convert coordinates from chunks to initial sequences - TEdenovo step 4 : filter cluster with only one sequence if N% > 40% - TEdenovo step 6 : in headers, always 3 characters for classification - TEdenovo step 6 : classification statistics - TEdenovo step 8 : headers must begin by classification - TEannot step 2 : no more "low complexity" filter for WU-BLAST (when called by Censor) - TEannot step 3 : filter match if N% > 90% - PASTEC : avoid "MySQL server has gone away" error when there is too many consensus - change jobs table to record computing resources information (to re-submit with the same resources) - allow to specify more than one computing resource in configuration file - information about jobs submission in standard output - catch help message from WU-BLAST - variance calculation New features: - TEdenovo step 2 struct : allow to change similarity parameter for LTRHarvest - TEdenovo : allow to use only the structural search till the pipeline end - PASTEC : allow to change limit job number in configuration file - TEannot : allow an other jobs manager : Torque (launcher migration) - TEannot : clean and drop_tables options in configuration file - configuration : add "repet_job_manager" option and set environment variables from configuration file as much as possible REPET release 1.4.2 (release date 02/12/11) Bug fixes: - force to use only one CPU for each blast - no more "low complexity" filter for WU-BLAST New features: - TEdenovo step 6 : change classification tool : PASTEC integration - TEdenovo step 6 : add Wicker's code in headers - TEdenovo step 6 : reverse complement consensus sequences if necessary - TEdenovo : allow an other jobs manager : Torque (launcher migration) - TEannot step 8 : add identity in GFF3 files - update Blaster parameters to use Blast+ (change NCBI-BLAST "gapopen" parameter to match NCBI-BLAST+) - configuration : add "copy" option REPET release 1.4.1 (release date 23/09/11) Bug fixes: - TEdenovo step 6 : correction when headers are shortened New features: - TEdenovo step 1 : add this step to prepare data (all following steps are shifted) - TEdenovo step 2 : bank copy in "tmpDir" (node directory) for Blaster - TEdenovo step 3 : change Grouper parameters (gap penalty=1 and gap distance=10) - TEdenovo step 2+3 : add structural search : LTRHarvest and Blastclust integration REPET release 1.4 (release date 15/04/11) Bug fixes: - update NCBI-BLAST parameters (since blastall 2.2.21, all combinations between "penalty" and "gapextend" are not allowed) - in configuration, "cluster" option isn't available anymore : pipelines can't be launched without a job scheduler (SGE) New features: - in configuration, change "queue" option to "resources" option (not mandatory) to set if specific resources are needed (memory or time) REPET release 1.3.13 (not released) Bug fixes: - in TEdenovo and TEannot, job submission management - in TEdenovo and TEannot, when too many files are produced : change files management - in TEdenovo and TEannot, when too many files are produced : change jobs recording in database - in TEdenovo step 5, correction of headers formating, to allow project names containing "_" - update WU-BLAST parameters to match NCBI-BLAST (in case : sensibility=3), and use only one CPU New features: - improvement of GiveInfoTEannot.py (script which compute statistics from the annotation results) - limit to 10000 the number of jobs in "waiting" status. - in TEdenovo, add step 6 : filter of SSR and noCat consensus built from less than 10 fragments - in TEdenovo, add step 7 : clustering of final consensus REPET release 1.3.12.1 (release date 20/10/2010) Bug fix: - in TEdenovo and TEannot, handle MySQL timeout REPET release 1.3.12 (release date 08/10/2010) Bug fixes: - in TEannot step 3, update all match scores when more than one alignment method is used to avoid fragmented matches in MATCHER - in TEannot step 7, manage DB connection New features: - in TEdenovo step 1, add support for megablast by BLASTER (for large genomes) - in TEdenovo step 2, add possibility to cluster in GROUPER by assessing overlaps in terms of number of nucleotides - in TEdenovo step 2, better handle identical members in GROUPER (likely to be palindromic repeats) - in TEdenovo step 4, adding the profile HMM search - in TEdenovo step 4 and 5, add support for 'TEclass' (Abrusan et al., 2009) - in both pipelines, improve management of job memory shortage, job timeout and node failure - in both pipelines, each step is run on cluster node(s) REPET release 1.3.11 (release date 28/07/2010) Bug fixes: - in TEdenovo and TEannot, give summary of execution times with only one job - in TEannot, filter matches even if minimum score is given as a float - in TEannot, properly compute weighted identity for long joins when some matches are overlapping New features: - in TEannot, allow to decide what to clean thanks to an option in command-line - in TEannot, allow to launch RepeatMasker with default sensitivity - add several tools in the public releases REPET release 1.3.10 (release date 06/07/2010) Bug fixes: - avoid to use "/tmp" when handling large data sets from MySQL tables New features: - in TEannot step 8, allow to delete MySQL tables once GFF3 or gameXML files have been made - in TEdenovo and TEannot, provide summary of execution times for jobs launched in parallel - in TEannot, add options to tune RepeatMasker options REPET release 1.3.9 (release date 05/05/2010) Bug fixes: - in TEdenovo step 5, when setting "cluster: yes" in the configuration file - in TEdenovo step 4, handle WU-BLAST exceptions when using tblastx and blastx New features: - in TEdenovo step 4 and 5, use super-family information when available in Repbase formatted for REPET - in TEdenovo step 5, improve the host's gene filter REPET release 1.3.8 (release date 06/04/2010) Bug fixes: - in TEannot step 1, take into account the kind of BLAST when preparing the banks New features: - in TEannot, add option in configuration file to specify BLAST (NCBI or WU) Development: - new package for data persistence called pyRepetUnit.commons.sql REPET release 1.3.7 (release date 26/02/2010) Bug fixes: - in TEdenovo step 5, properly remove redundancy within consensus of the same category New features: - in TEannot step 7, remove copies (chains of matches) shorter than a given length (20 bp by default) Development: - when launching jobs in parallel, free temporary directory in case of program error - new package for sequences manipulation called pyRepetUnit.commons.seq. Warning: - problem of portability on Solaris for Grouper (TEdenovo) REPET release 1.3.6 (release date 29/01/2010) New features: - in TEdenovo step 1, remove redundant matches due to parallelized all-by-all - in TEdenovo step 2, for Grouper, set stringent behavior when merging - in TEdenovo step 5, use blastn results when classifying consensus - in TEannot, add comparison with an amino-acid data-bank via blastx - in TEdenovo & TEannot, add checks (binaries, input files, parameters and SGE) Development: - new package for coordinates manipulation called pyRepetUnit.commons.coord. Warning: - problem of portability on Solaris for Grouper (TEdenovo) REPET release 1.3.5 (release date 17/12/2009) New features: - none Development: - generation of super-agent output - multifasta parser only with SNP (length = 1) - in package coord, migration of SetUtils Warning: - problem of portability on Solaris for Grouper (TEdenovo) REPET release 1.3.4 (release date 02/12/2009) New features: - in TEannot step 7, when trying to join two fragments, compute the weighted identity - when launching jobs in parallel, use a unique directory to avoid conflicts between users Development: - create utility classes for Path and Set objects (add a specific sort) Warning: - problem of portability on Solaris for Grouper (TEdenovo) REPET release 1.3.3.1 (release date 10/11/2009) Patch: - fix erratic troubles in TEdenovo step 4 when launching tblastx New features: - when launching jobs in parallel, check there is at least 1Gb in temporary directory REPET release 1.3.3 (release date 06/11/2009) New features: - allow to pass options to qsub via configuration file - in TEdenovo step 2, improve scalability on large, complex genomes Development: - "coord" API: 80% tested - "launcher" API: first template and implementations (Map, Mafft, RepeatMasker) Warning: - problem of portability on Solaris for Grouper (TEdenovo) - erratic troubles in TEdenovo step 4 when launching tblastx REPET release 1.3.2 (release date 05/10/2009) New features: - improve redundancy removal in Grouper - clean temporary directory in pipelines Development: - choose Doxygen for documentation - document tools and APIs (see Makefile) Warning: - problem of portability on Solaris for Grouper (TEdenovo) REPET release 1.3.1 (release date 08/09/2009) New features: - specify tmpDir for jobs execution in TEdenovo.cfg & TEannot.cfg Development: - continuous integration (unitary and functional tests) REPET release 1.3 (release date 21/07/2009) New features - in TEdenovo, in step 2, options to launch the clustering programs on a cluster node from within the pipeline - in TEdenovo, in step 2, when launching Grouper, build only the groups and not the clusters which speeds up a lot - in TEannot, in step 3, when launching Matcher, do the cleaning procedure after joining the HSPs - in TEannot, in step 7, use the weighted identity to approximate the age of a chain of TE fragments Development: - new methodology: test-driven development (TDD) - add some code with Python unitary tests in both pipelines REPET release 1.2.3 (release date 16/03/2009) New features: - in TEdenovo, options to filter HSPs after the clustering - in TEdenovo, option to choose the minimum number of bases to edit a consensus - in TEdenovo, speed up the step 4 to handle a larger number of consensus - in TEdenovo, option to launch the step 5 in parallel - in TEannot, option to compare the TE library on the randomized genomic sequences - in TEannot, option to force the usage of default values when filtering false positive Bug fixes: - in TEdenovo, correctly parse the joins made by GROUPER - in TEannot, update the connected fragments after the jobs were launched in parallel REPET release 1.2.2 (release date 31/10/2008) New features: - in TEdenovo, add option for GROUPER in the configuration file - in TEannot, use C++ parsers in some time-limiting steps - in TEannot, add option for BLASTER's sensitivity in the configuration file - in TEannot, allow not to launch tblastx alignments - in both, verbose options have been added in all python scripts (not yet for the C++ programs) Bug fixes: - in TEannot, rename all sequence headers internally to prevent errors (also for GFF3 export) REPET release 1.2.1 (release date 09/10/2008) Bug fixes: - in TEdenovo, for precaution, remove the space in sequence headers between "class" and "I" or "II" in the classification step - in TEannot, when necessary, take the default threshold from the configuration file - in TEannot, add option "join" in the step 3 when using MATCHER REPET release 1.2.0 (release date 30/09/2008) New feature: - in TEannot, add option -c in step 3 to allow different combinations of alignment programs Bug fixes: - in TEannot, rename all headers of the reference TE library to prevent crashes of RepeatMasker when they are longer than 50 characters - in pyRepet, don't use Unix's sed anymore as some sequence headers can be interpreted as regular expression REPET release 1.1.0 (release date 25/09/2008) New features: - in TEdenovo, increase BLASTER stringency to speed up the redundancy removal procedure - in TEdenovo, change the default redundancy removal procedure in the configuration file - in TEannot, manage IUPAC characters replacement before launching MREPS - in TEannot, decrease BLASTER sensitivity to speed up the computations Bug fixes: - in TEdenovo, fix problems when parsing outputs from RECON and PILER - in TEdenovo, keep N stretches when building chunks REPET release 1.0.0 (release date 31/07/2008)