README
The PASTEC package is distributed under the CeCILL license (see http://www.cecill.info/index.en.html). Please read distributed LICENSE file.
It has been deposited to the Agence de Protection des Programmes (APP) under the Inter Deposit Digital Number FR 001 480007 000 R P 2008 000 31 235.
Proper usage of PASTEClassifier requires a Unix-like system (64 bits) running on a cluster with the following, widely used components.
The full usage of PASTEClassifier package requires to install external programs.
Below is specified a quick description of the program, the name under which it is known, the version under which it has been tested and the URL to download it.
For each of these programs, it is much advised to read carefully its respective installation procedure as possible bugs may come, not from REPET but from bad installation of external programs.
Required :
- programming language interpreter, Python 2.6 and 2.7
- Python module, MySQLdb
- Python module, yaml
- database management system, MySQL, v >= 5.0, mysql .
- programming language, Awk, GNU version 3.1.5
Optional but highly recommended:
- pairwise alignment: NCBI-BLAST+ And/Or: NCBI-BLAST And/Or: WU-BLAST
- protein domains search: hmmer3 (hmmpress and hmmscan) package,
- SSR detection program: TRF , version 4.00,
Optional banks but highly recommended:
- Repbase Update, the well-known data-bank of known repeats. To obtain it, go to http://www.girinst.org/server/RepBase/index.php and download the REPET edition.
- If you want to search for protein domains by HMM profiles in your TE library, you need to have an appropriate bank of HMM profiles.A bank formatted for REPET is available at here .
Warning: MATCHER (which is part of the BLASTER suite distributed with PASTEClassifier) is also an EMBOSS program. Possible name conflicts.
Install
Most parts of the PASTEClassifier package are written in Python, a programming language that does not require compilation, while some are written in C++ and these binaries are in the PASTEC package.
The binaries must be used only on Linux 64-bits computer. If you would like to run PASTEClassifier on a different architecture, please contact us at urgi-repet[[@]]inra.fr.
Before using PASTEClassifier, please READ the tutorial "doc/PASTEClassifier_tuto.txt" or
PASTEClassifier tuto
.
Documentation
help to launch PASTEClassifier is in the file "doc/PASTEClassifier_tuto.txt" or
PASTEClassifier tuto
. it contains advice to configure the program, to exploit the output and to add elements in banks.
The file "Decision_rules.txt" explain how the classification is done.
The file "BLASTERsuite_doc.txt" in the REPET package (docs folder) gives more details about BLASTER and MATCHER programs.
Parallel computations
Nowadays, it is common to work with large amounts of data. Hence, whenever possible, we parallelized our pipelines to save computer time and reduce software memory requirements.
The REPET package works with a jobs scheduler like Slurm (slurm.schedmd.com),SGE (Sun Grid Engine)and TORQUE (formerly OpenPBS), three free batch-queuing systems.
In this aim, we developed a specific Python module managing these tasks: launching the jobs in parallel, tracking the errors and re-launching each job in error up to two times. Errors can be due to power break, no more disk space...
All the jobs details are stored in a mysql database's table named "jobs". For this, if you use REPET on a computer cluster the Python package "MySQLdb" has to be reachable from the master AND slave nodes.
Beside "squeue" (from slurm)"qstat" or (from SGE and TORQUE), you can use directly the "jobs" table.
Here are the kind of SQL commands you may need:
mysql> DESCRIBE jobs;
mysql> SELECT DISTINCT groupid FROM jobs;
mysql> SELECT status, count(*) FROM jobs WHERE groupid="exRepet_Blaster_Piler_Map" GROUP BY status;
mysql> UPDATE jobs SET status="error" WHERE groupid="exRepet_Blaster_Piler_Map" AND status="waiting";
mysql> DELETE FROM jobs WHERE groupid="exRepet_Blaster_Piler_Map";
Authors and contributors
(in alphabetical order)
Tina Alaeitabar
Francoise Alfama
Sandie Arnoux
Marc Bras
Laetitia Brigitte
Timothee Chaumier
Johann Confais
Timothee Flutre
Emeric Henrion
Claire Hoede
Olivier Inizan
Veronique Jamilloux
Jonathan Kreplak
Nacer Mohellibi
Mark Moissette
Erwan Ortie
Eric Penneçot
Hadi Quesneville
Mariène Wan
References
Below is a non-exhaustive list of publications related to the REPET package and the programs it integrates:
* Flutre T, Duprat E, Feuillet C, Quesneville H (2011), 'Considering transposable element diversification in de novo annotation approaches.', PLoS ONE 6(1): e16526.
doi:10.1371/journal.pone.0016526
* Quesneville H, Bergman C, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D (2005), 'Combined evidence annotation of transposable elements in genome sequences.', PLoS Comput Biol 1(2): e22.
doi:10.1371/journal.pcbi.0010022
* REPBASE: Jurka, J.; Kapitonov, V. V.; Pavlicek, A.; Klonowski, P.; Kohany, O. & Walichiewicz, J. (2005), 'Repbase Update, a database of eukaryotic repetitive elements.', Cytogenet Genome Res 110(1-4), 462--467.
* TRF: Benson, G. (1999), 'Tandem repeats finder: a program to analyze DNA sequences.', Nucleic Acids Res 27(2), 573--580.