Tutorial for TEdenovo included in REPET package v3.0
We recommend you start by running the TEdenovo pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences.
Thus, from now on, the project name is "DmelChr4".
For any information about all tools mentioned below, see the README .
Setup your working environment
Set environment variables
REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").
Add the path towards REPET programs to your path:
If you want to use tools from REPET package, you will have to set some other variables.
In this case, you can set the variables in setEnv.sh, based on the file "$REPET_PATH/config/setEnv.sh", and source it.
Ask to your system administrator about your database and scheduler informations and add them in setEnv.sh
Create your project directory (for instance "DmelChr4_TEdenovo/") and go into it:
Genomic sequence
Copy the input fasta file, or create a symbolic link, in your project file, it has to be named <project_name>.fa. The <project_name> must only contains letters, numbers and underscore '_', and be max. 15 characters long:
ADVICE: For your own project, verify fasta file format, each nucleic line has only 60 bps (or less).
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ")
or symbols such as "=", ";", ":", "|"...
We recommend to use PreProcess.py tool.
Configuration file:
Copy $REPET_PATH/config/TEdenovo.cfg in your project directory
Edit "TEdenovo.cfg" in order to adapt it to your personal situation. Below 'Adjustable parameters' are organized in severals [sections] corresponding to pipeline steps.
Run the pipeline
The standard output is rather self-explaining.The programs from REPET almost always begin with the sentence "beginning of ..." and ends with "... finished successfully".
Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned. Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out. To have more details, read the manual ("$ man nohup"). Here is an example:
To speed up the process, jobs are launched in parallel. In each section of configuration file, you can set option:
Introduction
The TEdenovo pipeline follows these three main steps:
After that, other processes are launched:
TEdenovo is able to look for repeated sequences by similarity and/or by structural (optional, see § 'Specific use') search. You can run either one or both means of detection.
Please have a look in steps descriptions below for command examples.
Regular use
Quick description of TEdenovo's steps for SIMILARITY BRANCH:
Step 1 genomic sequence preparation
In this step, the input genomic sequences are cut into chunks (threshold at 200kb with 10kb as overlap).
If the length of a genomic sequence is below the threshold, i.e. a chunk will never be a concatenation of two different input sequences.
In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb,
the possibility of putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs.
There will be up to "chunk_length" x "min_nb_seq_per_batch" nucleotides in each batch.
See TEdenovo.cfg
section [prepare_batches]
When you are ready, launch the following command:
Results
: In the example, DmelChr4_db directory is created where all fasta files containing the chunks are written as Batches/batch_*.fa.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEDenovo Sim tab for more details.
The second step aligns the genomic sequences of interest (in the example "DmelChr4.fa") with themselves in order to identify high-scoring segment pairs (HSPs) corresponding to repeats.
You can specify an option that may improve the computing time: copy {yes|no} (default: no): if 'yes', the genomic sequence is copied in the tmpDir specified previously
WARNING
: if you specify 'yes', it improves computing performances ONLY if you specified a tmpDir as a computing node directory (e.g. "/scratch")
You also have to make sure that neither a password nor a passphrase are required to connect to the computing nodes from the submission node.Please ask your system administrator for these two crucial points before using this option
The program BLASTER is used with stringent parameters.
Edit TEdenovo.cfg section [self_align] :
After BLASTER ran, HSPs can be filtered by "filter_HSP: yes". Even if threshold have already been defined above, you may want to be more stringent after the BLAST.
Moreover, it isn't possible during the BLAST, to filter a maximal HSP size (e.g. to remove matches corresponding to segmental duplications)
:
WARNING : Step 2 generates lots of files (by 'lots' we mean up to dozens of Go, of course depending on the size of the input data bank). Thus it is advised to keep only useful files ("clean: yes"). To see the differences, launch the step 2 on the example with and without this option.
When you are ready, launch the following command:
Results:
In the example, DmelChr4_Blaster directory is created where all the results (list of HSPs) are stored, usually in a tabulated file called DmelChr4.align.not_over.filtered (HSPs due to chunk overlaps were removed, and filter applied).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
The third step clusters the HSPs from step 2 to build clusters of repeats.Three clustering methods are available: GROUPER, RECON and PILER.It is better to launch the three methods in order to be able to combine the results afterwards.These clustering methods are independent so you have to launch three step 3 commands one for Grouper, one for Recon and one for Piler. As these programs have different running time, it allows you to launch the corresponding step 4 as soon as one program is finished.This is especially useful as Recon (and sometimes Grouper also) they usualy takes longer than Piler. But still, as the clustering programs usually require large resources, they will be launched on a cluster node within the pipeline.
Edit TEdenovo.cfg section [cluster_HSPs] if you need to change the default parameters.
For Grouper clustering program only, parameters are:
When you are ready, launch the following commands:
Results: I
n the example,For each clustering method, a directory DmelChr4_Blaster_<method> is created and contains several files including DmelChr4<XX>_filtered.log that records some statistics about this step, and DmelChr4_Blaster_<method_name>_3elem_20seq.fa that contains the sequences with header indicate to which cluster the sequence belong.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
This step makes a multiple alignment for each cluster from each used clustering methods. The available multiple sequence alignment (MSA) program is Map. Indeed this program implements a global multiple alignment algorithm that specifically takes into account long gaps.Thus it always runs on clusters from Recon whereas sometimes MUSCLE can never end. Moreover, it seems to give better alignment compare to MAFFT.
Note that, if the Map algorithm described in Huang (1994) remains unchanged, the program has been slightly improved to managed fasta files with several sequences more efficiently.Thus, in command-line, it is now called "rpt_map" instead of "map".
Once the MSA is built, a consensus is derived by taking the most frequent base at each site. Moreover, if only one sequence has a base at a specific site all the other having a gap (case of a unique insertion for instance), then the site is not taken into account for the consensus(the minimal number of bases to edit a consensus is minBasesPerSite parameter.
Edit TEdenovo.cfg section [build_consensus] if you need to change the default parameters.
When you are ready, launch the following commands:
These commands are independent, they can be launched at the same time. Please note that the jobs may compete against each other on the computing cluster if you do so.
Results,
In examle,for each clustering method, a directory DmelChr4_Blaster_<method>_Map is created containing the consensus file DmelChr4_Blaster_Grouper_Map_consensus_shortH.fa.
At this stage the consensus header become an identifier following this nomenclature: <projectName>-<selfAlignmentTool>-<clusteringTool><clusterNumber>-<msaTool><clusterMemberNumber>
(i.e. DmelChr4-B-G1-Map20 with <projectName>:DmelChr4, <selfAlignmentTool>: B for Blaster, <clusteringTool>: G for Grouper, <clusterNumber>:1, <msaTool>: Map, <clusterMemberNumber>: 20 (20 members at less to build the consensus). Legend for clustering tool: G for Grouper, R for Recon, P for Piler
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
Step 5 Consensus detect features
Then, we launch the first step of the PASTEClassifier, i.e. the detection of features on the consensus.
Edit TEdenovo.cfg
section [detect_feature]
if you need to change the default parameters.
Several programs can be launched to look for:
You can choose the blast program between NCBI-BLAST "blast: ncbi", NCBI-BLAST+ "blast: blastplus" and WU-BLAST "blast: wu".
You can also use RepeatScout to generate additional input consensuses for this step. In order to do this you need:
PASTEClassifier also looks for matches between the consensus and known TEs (e.g. Repbase Update). Repbase Update (Jurka J. et al., Cytogentic and Genome Research, 2005) is a famous databank of know repeats. To use it, you will have to register on "www.girinst.org".Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET. The archive contains two fasta files, one with nucleotide sequences given to BLASTER with tblastx ("TE_BLRtx: yes") and the other with aminoacid sequences given to BLASTER with blastx ("TE_BLRx: yes").If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.Furthermore, you can provide other data banks :
Make sure you put the databanks in your root project directory (copy or soft link) and indicate the name of each data bank in TEdenovo.cfg
section [detect_feature]
.You can choose the blast program between NCBI-BLAST, NCBI-BLAST+ and WU-BLAST ("blast: ncbi", "blast: blastplus", "blast: wu").
You can also adjust "TRFmaxPeriod" : maximum tandem repeats period size to be reported by TRF.
These programs listed above are launched in parallel. It can launch up to 1500 jobs (if there are 15000 consensus, each job will deal with 100 consensus).
When you are ready, launch the following command:
If you want to firstly generate additional consensuses using RepeatScout please use the following command:
If you want to use only detection by similarity, you must have ran corresponding previous steps. Please launch the following command:
Results:
In the example, a directory DmelChr4_Blaster_GrpRecPil_Map_TEclassif/detectFeatures is created containing folders with results of the different programs that have been launched. The folders list is ORF, polyA, Profiles, rDNA_BLRn, SSR, TE_BLRn, TE_BLRtx, TE_BLRx, TR. The corresponding MySQL tables are also created DmelChr4_sim_polyA_set, DmelChr4_sim_TRF_set, DmelChr4_sim_ORF_map, DmelChr4_sim_polyA_set, DmelChr4_sim_TE_BLRtn_path, DmelChr4_sim_BLRtx_path, DmelChr4_sim_BLRx_path, DmelChr4_sim_SSR_set, DmelChr4_sim_TR_set.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
Step 6 Consensus classification
This step classifies the consensus according to their features detected at step 5. The classification is made by the PASTEClassifier (or PASTEC).
For each consensus, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" ones (matches with known TEs,host genes, rDNA or HMM profiles).
During this step, several post-treatments are also available.
PASTEC classifies consensus in several groups: TEs are classified at the order and for some LTRs at the super family using Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007); and also not TEs like Short simple repeats (SSR), Potential Host Gene (PHG) and Potential ribosomal DNA (PrDNA).
If PASTEC doesn't find features the consensus is 'Unclassified'.
If PASTEC find over than one classification, the consensus is tagged 'Confused'
If the sequence consensus is on negative strand, the consensus is tagged 'reversed'
PASTEC characterize a consensus as TE by its 'Class', 'Order, 'Super Family', Wicker trigram and features.
Edit TEdenovo.cfg section [classif_consensus] if you need to change the default parameters.
The following parameters default values are defined from our experience with Drosophila melanogaster genome and from the paper "A unified classification system for eukaryotic transposable elements", Wicker et al., Nat.Rev.Genet., 2007.
When you are ready, launch the following command:
Results:
In the example, a directory DmelChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus is created containing several output files from all post-treatments.
Details about the classification are in several classification files (*.classif) associated to fasta file of consensus libraries:
A pre-curated classification is proposed in DmelChr4_sim_denovoLibTEs_PC.classif file, for LTR (RLX) they are classified at super family Copia or Gypsy if all TE_BLRx, TE_BLRtx and TE_BLRn features are exclusively Copia or Gypsy in association with DmelChr4_sim_denovoLibTEs_PC.classif_stats.txt.
Classification file is tabulated on 12 colomns as the classification table:
WARNING: if consensus is confused, the 'class', 'order', 'Wcode', 'sFamily' and 'CI' fields will contain all information separated by |.
See REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB) , TEdenovo Sim tab for more details.
This step filters the SSR and the consensus classified as "unclassified" only when they were built from less than 10 sequences.
In fact, before using the consensus library in the TEannot pipeline, you may want to filter them. For instance you may want to remove the consensus classified as SSR, HostGene, confused and unclassified.
To filter the consensus classified as "unclassified" only when they were built from less than 10 sequences, we use the "MSA program number". This number, in the header of each consensus after the name of the MSA program, corresponds to the number of sequences belonging to the multiple alignment from which the consensus was derived.
Edit TEdenovo.cfg section [filter_consensus] if you need to change the default parameters.
When you are ready, launch the following command:
Results:
In the example, the directory "DmelChr4_Blaster_*_Map_TEclassif_Filtered" is created containing the output file "DmelChr4_denovoLibTEs_filtered.fa" which can be directly used in the TEannot pipeline.
The corresponding classif and stats files are also provided: DmelChr4_sim_denovoLibTEs_PC_filtered.classif and DmelChr4_sim_denovoLibTEs_PC_filtered.classif_stats.txt
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
ADVICE: Before using the TEannot pipeline, please READ the file "TEannot_tuto.txt".
For the last step, it is useful to investigate the relationships among the de novo consensus that have been built, by grouping them into clusters (i.e. "TE families").
This step launch blastclust or the MCL programs according to "-f" option.
Edit TEdenovo.cfg section [cluster_consensus] if you need to change the default parameters.
When you are ready, launch the following command if you use Blastclust as clustering tool:
Results: In the example and depending on the chosen clustering method (-f option) either:
See REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB) , TEdenovo Sim tab for more details.