Tutorial for TEannot included in REPET package v3.0
We advise to run first the TEdenovo pipeline but it is not compulsory you can annotate your genome with your own TEs library.
We suppose you begin by running the TEannot pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences.
Thus, from now on, the project name is "DmelChr4".
For any informationabout tools mentioned below, see README
Setup your working environment
If you already ran the TEdenovo pipeline, you won't have to do all the following tasks, skip to Rename your input fasta file by <project_name>_refTEs.fa
Set environment variables.
REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").
Add the path towards REPET programs to your path:
If you want to use tools from REPET package, you will have to set some other variables.
In this case, you can set the variables in setEnv.sh, based on "$REPET_PATH/config/setEnv.sh", and source it.
Ask to your system administrator about your database and scheduler informations and add them in setEnv.sh.
Create your project directory (for instance "DmelChr4_TEannot/") and go into it:
Genomic sequences to annotate
Copy the input fasta file, or create a symbolic link, in your project file, it has to be named <project_name>.fa. The <project_name> must only contains letters, numbers and underscore '_', and be max. 15 characters long
ADVICE: For your own project, verify fasta file format, each nucleic line has only 60 bps (or less).
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ")
or symbols such as "=", ";", ":", "|"...
We recommend to use PreProcess.py tool
TEs library use to annotate
Rename your TEs library, in fasta format, by <project_name>_refTEs.fa.
If you already ran the TEdenovo pipeline, we advise you to use the filtered denovo TEs library, in our example:
Configuration file:
Edit "TEannot.cfg" in order to adapt it to your personal situation.
Run the pipeline
The standard output is rather self-explaining. The programs from REPET almost always begin with the sentence "beginning of ..." and ends with the sentence "... finished successfully".Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned.Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out.To have more details, read the manual ("$ man nohup"). Here is an example:
To speed up the process, jobs are launched in parallel.In each section of configuration file, you can set option:
Introduction
TEannot is able to annote a genome using DNA sequences library. This library can be a predicted TE library built by TEdenovo.
Please have a look in steps descriptions below for commands examples. Quick description of TEannot's steps:
Methodological advices
In order to obtain the best TEs genome annotation, it is highly advised to perform the following method:
Step 1 genomic sequence and data banks preparation
The first step prepares all the data banks required in the next steps.
In this step, the input genomic sequences are cut into chunks, by default threshold at 200kb with 10kb as overlap.
If the length of a genomic sequence is below the threshold, i.e. a chunk will never be a concatenation of two different input sequences. In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb, the possibility of putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs. There will be up to "chunk_length" x "min_nb_seq_per_batch" nucleotides in each batch.
In order to remove false positives, we apply an empirical statistical filter by comparing the reference TE library with the genomic sequences that have been randomized.
To use this filter or not, set "make_random_chunks" parameter. You will use threshold calculated from random chunks or your own ones at step 3 (see below).
Edit "TEannot.cfg" if you need to change the default parameters in [prepare_data] section
In order to remove false positives, we apply an empirical statistical filter by comparing the reference TE library with the genomic sequences that have been randomized.
You may need to change parameters in [align_refTEs_with_genome] section too, because the reference TEs library will be prepared according to the blast program you choose for step 2 (see below).
When you are ready, launch the following command :
Results:
The "DmelChr4_db/" directory is created, in which there are all the prepared data and two subdirectories "batches/" and "batches_rnd/".
In database, it also creates tables called "DmelChr4_chr_seq", "DmelChr4_chk_seq", "DmelChr4_chk_map", "DmelChr4_refTEs_seq" and "DmelChr4_refTEs_map".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
Step 2 Align the reference TE sequences on each chunk
The second step aligns the reference TE sequences (DmelChr4_refTEs.fa) on each genomic chunk via BLASTER (high sensitivity, followed by MATCHER) AND/OR REPEATMASKER (cutoff at 200) AND/OR CENSOR (high sensitivity).For each program, you can do the same on the randomized chunks (option "-r").
Edit TEannot.cfg if you need to change the default parameters in [align_refTEs_with_genome] section :
When you are ready, launch the following commands :
In order to compute a statistical filter in step 3, you can use the randomized chunks (optional) by launching step 2 again with "-r" option:
Results:
Two directories are created, DmelChr4_TEdetect/ and DmelChr4_TEdetect_rnd/, with three subdirectories corresponding to each alignment program (BLR for Blaster, RM for RepeatMasker and CEN for Censor) where results are stored. This step generates lots of files (by 'lots' it means up to dozens of Go, of course depending on the size of the input data bank).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
The third step filters and combines the HSPs from steps 2 to obtain the TE annotations (copies).
First, for each alignment program specified with option "-c" (by default, the 3 programs used at step 2), it determines the highest score obtained on the randomized chunks, of course, this requires the step 2 with option "-r" has been launched. More precisely, it uses the 95th percentile of the distribution of the highest scores obtained on each chunk.Then it filters the HSPs obtained on the "natural" chunks by keeping only the ones having a score higher than the threshold. For short input sequences, it may happen that a program (Blaster, Censor and/or RepeatMasker) doesn't find any HSP on the randomized chunks.In that case, a "Warning" is raised, a default value is given (from the configuration file) and "TEannot.py" goes on.
If you don't want to use the filter values found on the randomized chunks, you can force the usage of your own values in the configuration file ("force_default_values: yes" in
[filter] section
).
Next, for each batch, the 3 files (each from a different program) are concatenated and MATCHER is used to remove overlapping HSPs and make connections with the little "join" procedure.
When you are ready, launch the following command:
Results:
A subdirectory is created in "DmelChr4_TEdetect/Comb" with DmelChr4_TEannot_Matcher_path annotations file. There are 2 MySQL tables also created "DmelChr4_chk_allTEs_path" and "DmelChr4_chr_allTEs_path".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
The fourth step searches for satellites on the genomic sequences via TRF, Mreps and RepeatMasker (look only for simple repeats).
If you are not interested in satellites detection, you can skip STEP 4 and STEP 5.
Edit TEannot.cfg if you need to change the default parameters in [SSR_detect] section
When you are ready, launch the following command:
Results:
A directory is created, "DmelChr4_SSRdetect/", containing three subdirectories (TRF, Mreps and RMSSR) with the annotation results. They are also loaded into MySQL tables called "DmelChr4_chk_TRF", "DmelChr4_chk_Mreps" and "DmelChr4_chk_RMSSR".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
This step combines and merge SSR annotations from the 3 programs used at steps 4. For instance, a SSR detected by TRF with coordinates (100,500) and another detected by Mreps with coordinates (80,450) are merged into a SSR with coordinates (80,500).
If you are not interested in SSR detection and annotation, you can skip steps 4 and 5.
When you are ready, launch the following command:
Results:
New MySQL tables are created, called "DmelChr4_chk_allSSRs_set" and "DmelChr4_chr_allSSRs_set".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
Step 6 Comparison with data banks
This optionnal step compares a data bank (nucleotides or amino-acids, in fasta format, e.g. Repbase Update) with each input genomic sequence via BLASTER using tblastx or blastx, followed by MATCHER.
These steps are independent from steps 2 & 3, 4 & 5 and 7.
Edit TEannot.cfg ta set databanks name [align_other_banks] section
When you are ready, launch the following command:
Results: Subdirectories are created in "DmelChr4_TEdetect/bankBLR(t)x", containing the results. They are also loaded in MySQL tables "DmelChr4_chk_bankBLR(t)x_path", "DmelChr4_chr_bankBLR(t)x_path", "DmelChr4_bankBLR(t)x_(nt/prot)_seq").
Step 7 Remove spurious HSPs and long join procedure
This step performs successive procedures on the MySQL tables such as removal of TE doublons, removal of SSR annotations included into TE annotations and "long join procedure" (described below).
At the end of the step, there are 2 post treatments: as copies are detected by 3 methods (BLR, RM and CEN), it is necessary to calculate identity percentage homogeneously (with NWalign) for all copies and several stats are computed on this annotation.
Because the input genomic sequences may contain large regions of heterochromatin, some TEs are expected to be nested. As a given copy can be interrupted by several other TEs inserted more recently, we expect to find distant fragments belonging to the same copies.MATCHER is used at step 3, not only to filter overlapping HSPs, but also to join them. However, it relies on a scoring scheme that, in some extreme cases (deep nesting, distant fragmentation), appears to be unsufficient. Therefore we implemented a "long join procedure" aimed at recovering the join of these fragments missed sometimes by MATCHER. Fragments involved in nesting patterns must respect the three following constraints: (i) be co-linear; (ii) have the same age, and (iii) be separated by younger TE insertions. The identity percentage with a reference consensus sequence is used to estimate the age of a copy . Consecutive fragments on both the genome and the same reference TE were automatically joined if they respect these constraints. We call them "nest join". Sometimes large non-TE sequence insertions can be observed in a TE copy. They are suspected to appear by gene conversion. In order to deal with these cases, we also join fragments if they are separated by an insert of less than 5kb and/or less than 500bp of mismatches, and have the same age. We call this a "simple join".Young copies are expected to keep longer fragments than old copies, because deletions accumulate with time. This is a final control of nested patterns based on a different assumption than consensus nucleotide identity percentage (see above). Thus, at the end, nested TEs are split if inner TE fragments are longer than outer joined fragments. They are reported as "split".
Based on Drosophila Melanogaster genome (release 4), we took conservative parameters settings to join only unambiguous cases (Bergman et al., Genome Biology 2006,7:R112).
A "deny long join" occurs when age of fragments differs by more than 2% ("join_id_tolerance" parameter). This rejection is frequent compared to other event highlighting the importance of this constraint (i.e. considering the age of the fragments to join)."Too long join" occurs when the fragments to be joined are distant by more than 100kb. This appears to be very marginal.
A "deny nest join" occurs when either there is not an enough high TE coverage of the insert (>95%, "join_TEinsert_cov" parameter) or there is older TEs inserted. This appears to occur rarely.
Some "simple join" are performed, but their number still remains low compared to the number of fragments treated. This is a consequence of MATCHER join efficiency, indicating that "simple join" is needed only rarely. The same conclusion can be drawn for "splits". One could have set parameters at less conservative value and thus obtained more "long join", but we felt that these cases could thus be too ambiguous and we preferred to leave our results conservative.
Edit TEannot.cfg if you need to change the default parameters in
[annot_processing] section
When you are ready, launch the following command:
Results:
In database, several tables are created, "DmelChr4_chk_allTEs_nr_path", "DmelChr4_chk_allTEs_nr_noSSR_path" and finally "DmelChr4_chr_allTEs_nr_noSSR_join_path".
There is also DmelChr4_chr_allTEs_nr_noSSR_join_path_align table with new identity percentage of all copies, used in step 8 below.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
This step allows to export annotations from the final MySQL table to gameXML or GFF3 format. These two annotation formats can be imported respectively in Apollo and GBrowse.
Further details are available on the web:
Edit "TEannot.cfg" if you need to change the default parameters in [export] section .
When you are ready, launch one of the following command:
Results:
A directory is created, "DmelChr4_GFF3" or "DmelChr4_gameXML", containing the annotations files, one per sequences (chromosome or chunk).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.