TEannot tuto
Tutorial for TEannot included in REPET package v3.0
We advise to run first the TEdenovo pipeline but it is not compulsory you can annotate your genome with your own TEs library.
We suppose you begin by running the TEannot pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences.
Thus, from now on, the project name is "DmelChr4".
For any informationabout tools mentioned below, see README
Setup your working environment
If you already ran the TEdenovo pipeline, you won't have to do all the following tasks, skip to Rename your input fasta file by <project_name>_refTEs.fa
Set environment variables.
REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").
- export REPET_PATH=$HOME/src/repet_pipe/
Add the path towards REPET programs to your path:
- export PATH=$REPET_PATH/bin:...:$PATH
If you want to use tools from REPET package, you will have to set some other variables.
In this case, you can set the variables in setEnv.sh, based on "$REPET_PATH/config/setEnv.sh", and source it.
Ask to your system administrator about your database and scheduler informations and add them in setEnv.sh.
Create your project directory (for instance "DmelChr4_TEannot/") and go into it:
- cd $HOME/work/
- mkdir DmelChr4_TEannot
- cd DmelChr4_TEannot
Genomic sequences to annotate
Copy the input fasta file, or create a symbolic link, in your project file, it has to be named <project_name>.fa. The <project_name> must only contains letters, numbers and underscore '_', and be max. 15 characters long
- ln -s $REPET_PATH/db/DmelChr4.fa . in our example
ADVICE: For your own project, verify fasta file format, each nucleic line has only 60 bps (or less).
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ")
or symbols such as "=", ";", ":", "|"...
We recommend to use PreProcess.py tool
TEs library use to annotate
Rename your TEs library, in fasta format, by <project_name>_refTEs.fa.
If you already ran the TEdenovo pipeline, we advise you to use the filtered denovo TEs library, in our example:
- ln -s $HOME/work/DmelChr4_TEdenovo/DmelChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/DmelChr4_denovoLibTEs_filtered DmelChr4_refTEs.fa
Configuration file:
- cp $REPET_PATH/config/TEannot.cfg .
Edit "TEannot.cfg" in order to adapt it to your personal situation.
- In the section [repet_env], indicate (ask your system administrator):
- the host name of your MySQL database
- your MySQL login
- your MySQL password
- the name of your MySQL database
- the name of your jobs manager running on the computing cluster you are using ("SGE" or "TORQUE")
- In the section "project", indicate:
- the name of your project (here: DmelChr4)
- the absolute path to your project directory (here: $HOME/work/DmelChr4_TEdenovo)
Run the pipeline
The standard output is rather self-explaining. The programs from REPET almost always begin with the sentence "beginning of ..." and ends with the sentence "... finished successfully".Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned.Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out.To have more details, read the manual ("$ man nohup"). Here is an example:
- nohup TEannot.py -P ... -S 1 >& step1.txt &
To speed up the process, jobs are launched in parallel.In each section of configuration file, you can set option:
- Resources (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
- tmpDir (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch"). WARNING : if you let the empty default parameter, don't use 'yes' for the copy parameter described .
- copy {yes|no} (default: no): if 'yes', the genomic sequence is copied in the tmpDir specified previously (for now, it is only used in step 2).WARNING: if you specify 'yes', it improves computing performances ONLY if you specified a tmpDir, and if this tmpDir is a computing node directory (e.g. "/scratch"). You also have to make sure that neither a password nor a passphrase are required to connect to the computing nodes from the submission node.
- clean {yes|no} (default: yes): temporary files cleaning
- all parameters used below are set in TEdenovo.cfg file with their default value
Introduction
TEannot is able to annote a genome using DNA sequences library. This library can be a predicted TE library built by TEdenovo.
Please have a look in steps descriptions below for commands examples. Quick description of TEannot's steps:
- Step 1 : Genomic sequences are cut into batches and prepare shuffling batches.
- Step 2 : Align TEs library on genome sequences using 3 methods.
- Step 3 : Combine and filter the results from the 3 previous steps 2.
- Step 4 : Look for SSR sequences using 3 methods.
- Step 5 : Merge the SSR annotations from the 3 previous steps 4.
- Step 6 : Comparison with data banks (nucleotides or amino-acids, in fasta format, e.g. Repbase Update)
- Step 7 : Chains TEs fragments and manage nested TEs copies
- Step 8 : TE annotation export
Methodological advices
In order to obtain the best TEs genome annotation, it is highly advised to perform the following method:
- Firstly, run a quick TEannot, using only steps 1-2-3-7. You can use the output multifasta file from TEdenovo pipeline as TEs library.
- Then, select only consensus with Full Length Copy (FLC) or with Full Length Fragment (FLF), these consensus have at least one perfect match in the genome and they are considered as validated consensus. We named this library "validated TEs library".
- Run an other complete TEannot (steps 1-2-3-4-5-7-8, step 6 is really optionnal) on your original genome using this validated TEs library.
Step 1 genomic sequence and data banks preparation
The first step prepares all the data banks required in the next steps.
- Cut the input genomic sequences into chunks and load them in MySQL tables.
- Randomize the chunks, shuffling but preserve both mono- and di-symbol composition, and load them in a MySQL table.
- Load the reference TEs library (e.g. from the TEdenovo pipeline) in a MySQL table and prepare it for Blaster (blastn)
In this step, the input genomic sequences are cut into chunks, by default threshold at 200kb with 10kb as overlap.
If the length of a genomic sequence is below the threshold, i.e. a chunk will never be a concatenation of two different input sequences. In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb, the possibility of putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs. There will be up to "chunk_length" x "min_nb_seq_per_batch" nucleotides in each batch.
In order to remove false positives, we apply an empirical statistical filter by comparing the reference TE library with the genomic sequences that have been randomized.
To use this filter or not, set "make_random_chunks" parameter. You will use threshold calculated from random chunks or your own ones at step 3 (see below).
Edit "TEannot.cfg" if you need to change the default parameters in [prepare_data] section
- length threshold ("chunk_length: 200000")
- overlap length ("chunk_overlap: 10000")
- number of chunks per batch launched in parallel ("nb_seq_per_batch: 5")
In order to remove false positives, we apply an empirical statistical filter by comparing the reference TE library with the genomic sequences that have been randomized.
- Set "make_random_chunks: yes"If you don't want to use this filter, set "make_random_chunks: no". You will use your own filtering values at step 3 (see below).
You may need to change parameters in [align_refTEs_with_genome] section too, because the reference TEs library will be prepared according to the blast program you choose for step 2 (see below).
When you are ready, launch the following command :
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 1
Results:
The "DmelChr4_db/" directory is created, in which there are all the prepared data and two subdirectories "batches/" and "batches_rnd/".
In database, it also creates tables called "DmelChr4_chr_seq", "DmelChr4_chk_seq", "DmelChr4_chk_map", "DmelChr4_refTEs_seq" and "DmelChr4_refTEs_map".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
Step 2 Align the reference TE sequences on each chunk
The second step aligns the reference TE sequences (DmelChr4_refTEs.fa) on each genomic chunk via BLASTER (high sensitivity, followed by MATCHER) AND/OR REPEATMASKER (cutoff at 200) AND/OR CENSOR (high sensitivity).For each program, you can do the same on the randomized chunks (option "-r").
Edit TEannot.cfg if you need to change the default parameters in [align_refTEs_with_genome] section :
- Blast used by BLASTER: "BLR_blast: wu" or ncbi for ncbi blast or blastplus for ncbi blast+
- Blaster sensitivity: "BLR_sensitivity: 2".By decreasing BLASTER sensitivity, from 4 to 0, you will annotate fewer copies but more specific.
- Engine used by RepeatMasker "RM_engine: wu" or cm for crossmatch or ncbi for ncbi blast.
- RepeatMasker sensibility "RM_sensitivity: s" By decreasing sensitivity from qq to s, you will annotate fewer copies but more specific.
- Blast used by CENSOR "CEN_blast: wu"or ncbi for ncbi blast.
When you are ready, launch the following commands :
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a BLR
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a RM
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a CEN
In order to compute a statistical filter in step 3, you can use the randomized chunks (optional) by launching step 2 again with "-r" option:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a BLR -r
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a RM -r
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 2 -a CEN -r
Results:
Two directories are created, DmelChr4_TEdetect/ and DmelChr4_TEdetect_rnd/, with three subdirectories corresponding to each alignment program (BLR for Blaster, RM for RepeatMasker and CEN for Censor) where results are stored. This step generates lots of files (by 'lots' it means up to dozens of Go, of course depending on the size of the input data bank).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
The third step filters and combines the HSPs from steps 2 to obtain the TE annotations (copies).
First, for each alignment program specified with option "-c" (by default, the 3 programs used at step 2), it determines the highest score obtained on the randomized chunks, of course, this requires the step 2 with option "-r" has been launched. More precisely, it uses the 95th percentile of the distribution of the highest scores obtained on each chunk.Then it filters the HSPs obtained on the "natural" chunks by keeping only the ones having a score higher than the threshold. For short input sequences, it may happen that a program (Blaster, Censor and/or RepeatMasker) doesn't find any HSP on the randomized chunks.In that case, a "Warning" is raised, a default value is given (from the configuration file) and "TEannot.py" goes on.
If you don't want to use the filter values found on the randomized chunks, you can force the usage of your own values in the configuration file ("force_default_values: yes" in
[filter] section
).
Next, for each batch, the 3 files (each from a different program) are concatenated and MATCHER is used to remove overlapping HSPs and make connections with the little "join" procedure.
When you are ready, launch the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 3 -c BLR+RM+CEN
Results:
A subdirectory is created in "DmelChr4_TEdetect/Comb" with DmelChr4_TEannot_Matcher_path annotations file. There are 2 MySQL tables also created "DmelChr4_chk_allTEs_path" and "DmelChr4_chr_allTEs_path".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
The fourth step searches for satellites on the genomic sequences via TRF, Mreps and RepeatMasker (look only for simple repeats).
If you are not interested in satellites detection, you can skip STEP 4 and STEP 5.
Edit TEannot.cfg if you need to change the default parameters in [SSR_detect] section
- RMSSR_engine : with wu or cm (crossmatch)
- TRFmaxPeriod: 15 maximum tandem repeats period size to be reported by TRF
When you are ready, launch the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s TRF
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s Mreps
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 4 -s RMSSR
Results:
A directory is created, "DmelChr4_SSRdetect/", containing three subdirectories (TRF, Mreps and RMSSR) with the annotation results. They are also loaded into MySQL tables called "DmelChr4_chk_TRF", "DmelChr4_chk_Mreps" and "DmelChr4_chk_RMSSR".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
This step combines and merge SSR annotations from the 3 programs used at steps 4. For instance, a SSR detected by TRF with coordinates (100,500) and another detected by Mreps with coordinates (80,450) are merged into a SSR with coordinates (80,500).
If you are not interested in SSR detection and annotation, you can skip steps 4 and 5.
When you are ready, launch the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 5
Results:
New MySQL tables are created, called "DmelChr4_chk_allSSRs_set" and "DmelChr4_chr_allSSRs_set".
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
Step 6 Comparison with data banks
This optionnal step compares a data bank (nucleotides or amino-acids, in fasta format, e.g. Repbase Update) with each input genomic sequence via BLASTER using tblastx or blastx, followed by MATCHER.
These steps are independent from steps 2 & 3, 4 & 5 and 7.
Edit TEannot.cfg ta set databanks name [align_other_banks] section
- To compare with a nucleotides bank set "bankBLRtx: <nucleotide_sequences_bank_name>"
- To compare with a amino-acids bank set "bankBLRx: <amino-acids_sequences_bank_name>"
When you are ready, launch the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 6 -b tblastx
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 6 -b blastx
Results: Subdirectories are created in "DmelChr4_TEdetect/bankBLR(t)x", containing the results. They are also loaded in MySQL tables "DmelChr4_chk_bankBLR(t)x_path", "DmelChr4_chr_bankBLR(t)x_path", "DmelChr4_bankBLR(t)x_(nt/prot)_seq").
Step 7 Remove spurious HSPs and long join procedure
This step performs successive procedures on the MySQL tables such as removal of TE doublons, removal of SSR annotations included into TE annotations and "long join procedure" (described below).
At the end of the step, there are 2 post treatments: as copies are detected by 3 methods (BLR, RM and CEN), it is necessary to calculate identity percentage homogeneously (with NWalign) for all copies and several stats are computed on this annotation.
Because the input genomic sequences may contain large regions of heterochromatin, some TEs are expected to be nested. As a given copy can be interrupted by several other TEs inserted more recently, we expect to find distant fragments belonging to the same copies.MATCHER is used at step 3, not only to filter overlapping HSPs, but also to join them. However, it relies on a scoring scheme that, in some extreme cases (deep nesting, distant fragmentation), appears to be unsufficient. Therefore we implemented a "long join procedure" aimed at recovering the join of these fragments missed sometimes by MATCHER. Fragments involved in nesting patterns must respect the three following constraints: (i) be co-linear; (ii) have the same age, and (iii) be separated by younger TE insertions. The identity percentage with a reference consensus sequence is used to estimate the age of a copy . Consecutive fragments on both the genome and the same reference TE were automatically joined if they respect these constraints. We call them "nest join". Sometimes large non-TE sequence insertions can be observed in a TE copy. They are suspected to appear by gene conversion. In order to deal with these cases, we also join fragments if they are separated by an insert of less than 5kb and/or less than 500bp of mismatches, and have the same age. We call this a "simple join".Young copies are expected to keep longer fragments than old copies, because deletions accumulate with time. This is a final control of nested patterns based on a different assumption than consensus nucleotide identity percentage (see above). Thus, at the end, nested TEs are split if inner TE fragments are longer than outer joined fragments. They are reported as "split".
Based on Drosophila Melanogaster genome (release 4), we took conservative parameters settings to join only unambiguous cases (Bergman et al., Genome Biology 2006,7:R112).
A "deny long join" occurs when age of fragments differs by more than 2% ("join_id_tolerance" parameter). This rejection is frequent compared to other event highlighting the importance of this constraint (i.e. considering the age of the fragments to join)."Too long join" occurs when the fragments to be joined are distant by more than 100kb. This appears to be very marginal.
A "deny nest join" occurs when either there is not an enough high TE coverage of the insert (>95%, "join_TEinsert_cov" parameter) or there is older TEs inserted. This appears to occur rarely.
Some "simple join" are performed, but their number still remains low compared to the number of fragments treated. This is a consequence of MATCHER join efficiency, indicating that "simple join" is needed only rarely. The same conclusion can be drawn for "splits". One could have set parameters at less conservative value and thus obtained more "long join", but we felt that these cases could thus be too ambiguous and we preferred to leave our results conservative.
Edit TEannot.cfg if you need to change the default parameters in
[annot_processing] section
- Copies with length below "min_size" bp are removed "min_size:20"
- If distance between two fragments exceed "join_max_gap_size", fragments are not connected. "join_max_gap_size: 5000":
- If mismatch length (bp) between two fragments (in dynamic programming algorithm, see Quesneville et al. 2005) exceed "join_max_mismatch_size: 500", fragments are not connected.
- If age between two fragments (identity percentage) exceed "join_id_tolerance: 2", fragments are not connected.
- If distance between two fragments exceed "join_max_gap_size" and if at least "join_TEinsert_cov: 0.95" % of genome sequence between fragments is composed of younger TEs, fragments are connected.
- If size (bp) of overlap between two fragments exceed "join_overlap: 15", fragments are not connected.
- If nested TE is older than flanking fragments but its size exceed "join_minlength_split: 100", fragments are not connected.
When you are ready, launch the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 7
Results:
In database, several tables are created, "DmelChr4_chk_allTEs_nr_path", "DmelChr4_chk_allTEs_nr_noSSR_path" and finally "DmelChr4_chr_allTEs_nr_noSSR_join_path".
There is also DmelChr4_chr_allTEs_nr_noSSR_join_path_align table with new identity percentage of all copies, used in step 8 below.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.
This step allows to export annotations from the final MySQL table to gameXML or GFF3 format. These two annotation formats can be imported respectively in Apollo and GBrowse.
Further details are available on the web:
- gameXML: http://www.fruitfly.org/annot/gamexml.dtd.txt
- GFF3: http://www.sequenceontology.org/gff3.shtml
- Apollo: http://gmod.org/wiki/index.php/Apollo
- Jbrowse : https://jbrowse.org/
- GBrowse: http://gmod.org/wiki/index.php/Gbrowse
Edit "TEannot.cfg" if you need to change the default parameters in [export] section .
- To export the annotations on the input sequences set "sequences: chromosomes" or on the chunks set "sequences: chunks"
- To add the SSR annotations by setting "add_SSRs: yes" as well as the annotations found via tblastx by setting "add_tBx: yes" or blastx by setting "add_Bx: yes" (assuming you launched step 6 before).
- To remove overlap between 2 copies from the same consensensu set "rmv_overlapping_annotations: yes"
- To keep the gff3 files corresponding to the input genomic sequences without TEs annotation, set "keep_gff3_files_without_annotations: yes". In this case, the corresponding gff3 file will be empty unless the "gff3_with_genomic_sequence" is set to 'yes'
- In the gff3 file, to merge redundant matches (same start, same end, same score and on the same sequence) set "gff3_merge_redundant_features: yes". The name of the other consensus are tagged 'other targets' in the attributes field (ninth field)
- To generate a match part for each match, set "gff3_compulsory_match_part: yes"
- To add the annotated genomic sequence at the end of gff3 files, set "gff3_with_genomic_sequence: yes"
- To add the TE length in the attributes field (ninth field) for each match, set "gff3_with_TE_length: yes"
- To add the TE classification information in the field attributes (ninth field) with tag "TargetDescription" for each match, set 'gff3_with_classif_info: yes" and give the name of the TE table by setting "classif_table_name: <name_of_TEs_table>" (default if empty: "<project_name>_consensus_classif" from TEdenovo)
- To get gff3 files compatible with a chado database, set "gff3_chado: yes"
- If you set "drop_tables: yes", be careful because all the MySQL tables will be deleted. Do it only if you are sure you don't need them anymore.
When you are ready, launch one of the following command:
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 8 -o GFF3
- TEannot.py -P DmelChr4 -C TEannot.cfg -S 8 -o gameXML
Results:
A directory is created, "DmelChr4_GFF3" or "DmelChr4_gameXML", containing the annotations files, one per sequences (chromosome or chunk).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEannot tab for more details.