TEdenovo tuto
Tutorial for TEdenovo included in REPET package v3.0
We recommend you start by running the TEdenovo pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences.
Thus, from now on, the project name is "DmelChr4".
For any information about all tools mentioned below, see the README .
Setup your working environment
Set environment variables
REPET_PATH gives the absolute path to the directory in which REPET has been installed (e.g. "$HOME/src/repet_pipe/").
- export REPET_PATH=$HOME/src/repet_pipe/
Add the path towards REPET programs to your path:
- export PATH=$REPET_PATH/bin:...:$PATH
If you want to use tools from REPET package, you will have to set some other variables.
In this case, you can set the variables in setEnv.sh, based on the file "$REPET_PATH/config/setEnv.sh", and source it.
Ask to your system administrator about your database and scheduler informations and add them in setEnv.sh
Create your project directory (for instance "DmelChr4_TEdenovo/") and go into it:
- cd $HOME/work/
- mkdir DmelChr4_TEdenovo
- cd DmelChr4_TEdenovo
Genomic sequence
Copy the input fasta file, or create a symbolic link, in your project file, it has to be named <project_name>.fa. The <project_name> must only contains letters, numbers and underscore '_', and be max. 15 characters long:
- ln -s $REPET_PATH/db/DmelChr4.fa in our example.
ADVICE: For your own project, verify fasta file format, each nucleic line has only 60 bps (or less).
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ")
or symbols such as "=", ";", ":", "|"...
We recommend to use PreProcess.py tool.
Configuration file:
Copy $REPET_PATH/config/TEdenovo.cfg in your project directory
- cp $REPET_PATH/config/TEdenovo.cfg .
Edit "TEdenovo.cfg" in order to adapt it to your personal situation. Below 'Adjustable parameters' are organized in severals [sections] corresponding to pipeline steps.
- In [repet_env], indicate (ask your system administrator)
- repet_host: the host name of your MySQL database
- repet_user: your MySQL login
- repet_pw: your MySQL password
- repet_db: your MySQL database name
- repet_port: your MySQL port
- repet_job_manager: Jobs manager name that run on your computer or cluster, it might be Slurm, SGE, TORQUE, PBS
- In [project], indicate:
- project_name: the name of your project (here: DmelChr4)
- project_dir: the absolute path to your project directory (here: $HOME/work/DmelChr4_TEdenovo)
Run the pipeline
The standard output is rather self-explaining.The programs from REPET almost always begin with the sentence "beginning of ..." and ends with "... finished successfully".
Each program launching another one goes on only when EXIT_SUCCESS (usually "0") is returned. Otherwise the sentence "*** Error: 'program X' returned 256" is written and the whole pipeline stops.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".This program runs a command even if the session is disconnected or the user logs out. To have more details, read the manual ("$ man nohup"). Here is an example:
- nohup TEdenovo.py -P ... -S 1 >& step1.txt &
To speed up the process, jobs are launched in parallel. In each section of configuration file, you can set option:
- resources (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
- tmpDir (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch"). WARNING : if you let the empty default parameter, don't use 'yes' for the copy parameter described in step 2.
- clean {yes|no} (default: yes): temporary files are cleaned
- all parameters used below are set in TEdenovo.cfg file with their default value.
Introduction
The TEdenovo pipeline follows these three main steps:
- Detection of repeated sequences (potential TEs)
- Clustering of these sequences
- Build of consensus sequences for each cluster, representing the ancestral TE
After that, other processes are launched:
- Each consensus sequences is caracterized and classified using Wicker's TE Classification
- The TEs consensus bank is filtered on several criteria
- The TEs are grouped by families
TEdenovo is able to look for repeated sequences by similarity and/or by structural (optional, see ยง 'Specific use') search. You can run either one or both means of detection.
Please have a look in steps descriptions below for command examples.
Regular use
Quick description of TEdenovo's steps for SIMILARITY BRANCH:
- Step 1 : Genomic sequences are cut into batches
- Step 2 : The genome is aligned to itself using Blast
- Step 3 : The repetitives HSP from BLAST are clustered by Recon, Grouper and/or Piler
- Step 4 : A multiple alignment is computed for each cluster, and a consensus sequence is derived from each multiple alignment
- Step 5 : Particular features are detected on each consensus, such as structural features or homology with known TEs, HMM profiles or host genes
- Step 6 : The consensus are classified using Wicker's classification
- Step 7 : SSR and under-represented unclassified consensus are filtered
- Step 8 : The consensus are clustered into families to facilitate manual curation using Blastclust or MCL
Step 1 genomic sequence preparation
In this step, the input genomic sequences are cut into chunks (threshold at 200kb with 10kb as overlap).
If the length of a genomic sequence is below the threshold, i.e. a chunk will never be a concatenation of two different input sequences.
In the case you have a very high number of small sequences (e.g. 70000 input sequences of mean size 100kb), it is still advised to keep the threshold at 200kb,
the possibility of putting several chunks into the same batch (the batches being launched on parallel) allowing to have a reasonable number of jobs.
There will be up to "chunk_length" x "min_nb_seq_per_batch" nucleotides in each batch.
See TEdenovo.cfg
section [prepare_batches]
- length threshold ("chunk_length: 200000")
- overlap length ("chunk_overlap: 10000")
- number of chunks per batch launched in parallel ("nb_seq_per_batch: 5")
When you are ready, launch the following command:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 1
Results
: In the example, DmelChr4_db directory is created where all fasta files containing the chunks are written as Batches/batch_*.fa.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEDenovo Sim tab for more details.
The second step aligns the genomic sequences of interest (in the example "DmelChr4.fa") with themselves in order to identify high-scoring segment pairs (HSPs) corresponding to repeats.
You can specify an option that may improve the computing time: copy {yes|no} (default: no): if 'yes', the genomic sequence is copied in the tmpDir specified previously
WARNING
: if you specify 'yes', it improves computing performances ONLY if you specified a tmpDir as a computing node directory (e.g. "/scratch")
You also have to make sure that neither a password nor a passphrase are required to connect to the computing nodes from the submission node.Please ask your system administrator for these two crucial points before using this option
The program BLASTER is used with stringent parameters.
Edit TEdenovo.cfg section [self_align] :
- you can choose the blast program between NCBI-BLAST, NCBI-BLAST+ and WU-BLAST ("blast: ncbi", "blast: blastplus", "blast: wu").
- BLAST returns only HSPs having an E-value below 1e-300 by setting "Evalue: 1e-300"
- BLAST returns only HSPs having a length above 100 (bp) by setting "length: 100"
- BLAST returns only HSPs having an identity percentage above 90 (in %) by setting "identity: 90"
After BLASTER ran, HSPs can be filtered by "filter_HSP: yes". Even if threshold have already been defined above, you may want to be more stringent after the BLAST.
Moreover, it isn't possible during the BLAST, to filter a maximal HSP size (e.g. to remove matches corresponding to segmental duplications)
:
- keep only HSPs having anE-value below 1e-300 by setting "min_Evalue: 1e-300"
- keep only HSPs having an identity percentage above 90 (%) by setting "min_identity: 90"
- keep only HSPs having a length below 100 (bp) by setting "min_length: 100"
- keep only HSPs having a length above 20000 (bp) by setting "max_length: 20000"
WARNING : Step 2 generates lots of files (by 'lots' we mean up to dozens of Go, of course depending on the size of the input data bank). Thus it is advised to keep only useful files ("clean: yes"). To see the differences, launch the step 2 on the example with and without this option.
When you are ready, launch the following command:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 2 -s Blaster
Results:
In the example, DmelChr4_Blaster directory is created where all the results (list of HSPs) are stored, usually in a tabulated file called DmelChr4.align.not_over.filtered (HSPs due to chunk overlaps were removed, and filter applied).
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
The third step clusters the HSPs from step 2 to build clusters of repeats.Three clustering methods are available: GROUPER, RECON and PILER.It is better to launch the three methods in order to be able to combine the results afterwards.These clustering methods are independent so you have to launch three step 3 commands one for Grouper, one for Recon and one for Piler. As these programs have different running time, it allows you to launch the corresponding step 4 as soon as one program is finished.This is especially useful as Recon (and sometimes Grouper also) they usualy takes longer than Piler. But still, as the clustering programs usually require large resources, they will be launched on a cluster node within the pipeline.
Edit TEdenovo.cfg section [cluster_HSPs] if you need to change the default parameters.
- minNbSeqPerGroup, default 3 : minimum number of sequences per group, to avoid most of the segmental duplications.
- nbLongestSeqPerGroup, default 20 : select the "nbLongestSeqPerGroup" longest sequences of each group.
- maxSeqLength, default 20000 : max sequence length (bp) in groups.
For Grouper clustering program only, parameters are:
- Grouper_nbGroup, default 1: , allow you to run grouper as multiple jobs. Send "Grouper_nbGroup" jobs in parallel on the cluster nodes (use 1 for the regular grouper).
- Grouper_coverage, default 0.95 : coverage between all sequences in a group is at least "Grouper_coverage".
- Grouper_include, default 2 : keep groups where at least "Grouper_include" members are not included in other groups.
- Grouper_maxJoinLength, default 30000 : maximum length of a join. If distance between 2 TEs is above "Grouper_maxJoinLength", TEs will not be joined.
When you are ready, launch the following commands:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Grouper
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Recon
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Piler
Results: I
n the example,For each clustering method, a directory DmelChr4_Blaster_<method> is created and contains several files including DmelChr4<XX>_filtered.log that records some statistics about this step, and DmelChr4_Blaster_<method_name>_3elem_20seq.fa that contains the sequences with header indicate to which cluster the sequence belong.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
This step makes a multiple alignment for each cluster from each used clustering methods. The available multiple sequence alignment (MSA) program is Map. Indeed this program implements a global multiple alignment algorithm that specifically takes into account long gaps.Thus it always runs on clusters from Recon whereas sometimes MUSCLE can never end. Moreover, it seems to give better alignment compare to MAFFT.
Note that, if the Map algorithm described in Huang (1994) remains unchanged, the program has been slightly improved to managed fasta files with several sequences more efficiently.Thus, in command-line, it is now called "rpt_map" instead of "map".
Once the MSA is built, a consensus is derived by taking the most frequent base at each site. Moreover, if only one sequence has a base at a specific site all the other having a gap (case of a unique insertion for instance), then the site is not taken into account for the consensus(the minimal number of bases to edit a consensus is minBasesPerSite parameter.
Edit TEdenovo.cfg section [build_consensus] if you need to change the default parameters.
- The minimal number of bases to edit a consensus is "minBasesPerSite: 2"
When you are ready, launch the following commands:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map
These commands are independent, they can be launched at the same time. Please note that the jobs may compete against each other on the computing cluster if you do so.
Results,
In examle,for each clustering method, a directory DmelChr4_Blaster_<method>_Map is created containing the consensus file DmelChr4_Blaster_Grouper_Map_consensus_shortH.fa.
At this stage the consensus header become an identifier following this nomenclature: <projectName>-<selfAlignmentTool>-<clusteringTool><clusterNumber>-<msaTool><clusterMemberNumber>
(i.e. DmelChr4-B-G1-Map20 with <projectName>:DmelChr4, <selfAlignmentTool>: B for Blaster, <clusteringTool>: G for Grouper, <clusterNumber>:1, <msaTool>: Map, <clusterMemberNumber>: 20 (20 members at less to build the consensus). Legend for clustering tool: G for Grouper, R for Recon, P for Piler
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
Step 5 Consensus detect features
Then, we launch the first step of the PASTEClassifier, i.e. the detection of features on the consensus.
Edit TEdenovo.cfg
section [detect_feature]
if you need to change the default parameters.
Several programs can be launched to look for:
- terminal repeats with TRsearch by setting "term_rep: yes"
- tandem repeats with TRF by setting "tand_rep: yes"
- open reading frames with dbORF.py by setting "orf: yes"
- poly-A tails with polyAtail by setting "polyA: yes"
- TEclass: please do not change this option as it is experimental
You can choose the blast program between NCBI-BLAST "blast: ncbi", NCBI-BLAST+ "blast: blastplus" and WU-BLAST "blast: wu".
You can also use RepeatScout to generate additional input consensuses for this step. In order to do this you need:
- To use the provided LaunchRepeatScout tool
- In the configuration file, set "RepScout: yes" and set "RepScout_bank: <bank_of_RepeatScout>". Make sure that either <bank_of_RepeatScout> file is in your root project directory (copy or soft link), or to provide a valid absolute path to <bank_of_RepeatScout>. To use this feature, you have to install RepeatScout first. It is advised to download and install the latest stable version from the UCSD website ("http://bix.ucsd.edu/repeatscout/")
PASTEClassifier also looks for matches between the consensus and known TEs (e.g. Repbase Update). Repbase Update (Jurka J. et al., Cytogentic and Genome Research, 2005) is a famous databank of know repeats. To use it, you will have to register on "www.girinst.org".Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET. The archive contains two fasta files, one with nucleotide sequences given to BLASTER with tblastx ("TE_BLRtx: yes") and the other with aminoacid sequences given to BLASTER with blastx ("TE_BLRx: yes").If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.Furthermore, you can provide other data banks :
- HMM profiles ("TE_HMMER: yes"), it is possible to search HMM profiles in the consensus via hmmer2 or hmmer3. It's very usefull for PASTEC (see below). You can download the profile bank for Repet ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm , which comes from Pfam database (M. Punta, et al., Nucleic Acids Research, 2012) and is formatted for REPET, WARNING : this bank can only be used with hmmer3. You can also use your own bank, but each profile name have to be well formatted (<ACC>_<NAME>_<type according to key words found in DESC>_<GA>).
- cDNA from the host genome ("HG_BLRn: yes"), it is possible to compare them with the consensus via BLASTER with blastn.
- rDNA ("rDNA_BLRn: yes"), it is possible to compare them with the consensus via BLASTER with blastn.
- tRNA parameter "tRNA_scan: yes" to launch 'TRNAscanSE' to detect tRNAs.
Make sure you put the databanks in your root project directory (copy or soft link) and indicate the name of each data bank in TEdenovo.cfg
section [detect_feature]
.You can choose the blast program between NCBI-BLAST, NCBI-BLAST+ and WU-BLAST ("blast: ncbi", "blast: blastplus", "blast: wu").
You can also adjust "TRFmaxPeriod" : maximum tandem repeats period size to be reported by TRF.
These programs listed above are launched in parallel. It can launch up to 1500 jobs (if there are 15000 consensus, each job will deal with 100 consensus).
When you are ready, launch the following command:
If you want to firstly generate additional consensuses using RepeatScout please use the following command:
- LaunchRepeatScout.py -i <inputFastaFileName>
If you want to use only detection by similarity, you must have ran corresponding previous steps. Please launch the following command:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map
Results:
In the example, a directory DmelChr4_Blaster_GrpRecPil_Map_TEclassif/detectFeatures is created containing folders with results of the different programs that have been launched. The folders list is ORF, polyA, Profiles, rDNA_BLRn, SSR, TE_BLRn, TE_BLRtx, TE_BLRx, TR. The corresponding MySQL tables are also created DmelChr4_sim_polyA_set, DmelChr4_sim_TRF_set, DmelChr4_sim_ORF_map, DmelChr4_sim_polyA_set, DmelChr4_sim_TE_BLRtn_path, DmelChr4_sim_BLRtx_path, DmelChr4_sim_BLRx_path, DmelChr4_sim_SSR_set, DmelChr4_sim_TR_set.
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
Step 6 Consensus classification
This step classifies the consensus according to their features detected at step 5. The classification is made by the PASTEClassifier (or PASTEC).
For each consensus, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" ones (matches with known TEs,host genes, rDNA or HMM profiles).
During this step, several post-treatments are also available.
PASTEC classifies consensus in several groups: TEs are classified at the order and for some LTRs at the super family using Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007); and also not TEs like Short simple repeats (SSR), Potential Host Gene (PHG) and Potential ribosomal DNA (PrDNA).
If PASTEC doesn't find features the consensus is 'Unclassified'.
If PASTEC find over than one classification, the consensus is tagged 'Confused'
If the sequence consensus is on negative strand, the consensus is tagged 'reversed'
PASTEC characterize a consensus as TE by its 'Class', 'Order, 'Super Family', Wicker trigram and features.
Edit TEdenovo.cfg section [classif_consensus] if you need to change the default parameters.
The following parameters default values are defined from our experience with Drosophila melanogaster genome and from the paper "A unified classification system for eukaryotic transposable elements", Wicker et al., Nat.Rev.Genet., 2007.
-
"limit_job_nb: 0" <- parameter to limit the jobs number for PASTEC. Each job represents a PASTEC process, so one connection. But at the beginning, PASTEC retrieves the results from database.
So depending on the amount of data in database and your computing cluster configuration (allowing, per example, 700 jobs running at the same time), MySQL server can be overloaded.
You may want to limit the simultaneous connections to MySQL server for PASTEC (0 = no limit). - "max_profiles_evalue: 1e-3" <- only matches on profiles bank below this e-value are kept
- "min_TE_profiles_coverage: 20" <- minimal coverage between consensus and profiles for TEs
- "min_HG_profiles_coverage: 75" <- minimal coverage between consensus and profiles host genes (if no other classif was found)
- "max_helitron_extremities_evalue: 1e-3" <- above this evalue, do not consider the match in regards to helitron classification
- "min_TE_bank_coverage: 5" <- min coverage above which match gets disregarded
- "min_HG_bank_coverage: 95" <- min coverage above which a consensus is considered as host gene
- "min_HG_bank_identity: 90" <- min identity above which a consensus is considered as host gene (used in conjunction with the coverage threshold above)
- "min_rDNA_bank_coverage: 95" <- min coverage above which a consensus is considered as rDNA
- "min_rDNA_bank_identity: 90" <- min identity above which a consensus is considered as rDNA (used in conjunction with the coverage threshold above)
- "min_SSR_coverage: 75" <- minimal percentage of SSR in the consensus
- "max_SSR_size: 100" <- max size to consider consensus as SSR
- "remove_redundancy: yes" <- to remove consensus classified as TEs considered as identical
- "min_redundancy_identity: 95" <- minimal identity beyond which two consensus are considered identical
- "min_redundancy_coverage: 98" <- minimal coverage beyond which two consensus are considered identical
- "rev_complement: yes" <- reverse sequence on negative strand
- "add_noCat_bestHitClassif: no" <- if yes, for each 'unclassified' consensus, PASTEC will specify the classification of closest TE consensus found by BLAST.
When you are ready, launch the following command:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map
Results:
In the example, a directory DmelChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus is created containing several output files from all post-treatments.
Details about the classification are in several classification files (*.classif) associated to fasta file of consensus libraries:
- DmelChr4_sim_denovoLibTEs.fa associated with classification file DmelChr4_sim_denovoLibTEs.classif and statistics file DmelChr4_sim_denovoLibTEs.classif_stats.txt.
- the MySQL table DmelChr4_sim_consensus_classif is also created along the way.
A pre-curated classification is proposed in DmelChr4_sim_denovoLibTEs_PC.classif file, for LTR (RLX) they are classified at super family Copia or Gypsy if all TE_BLRx, TE_BLRtx and TE_BLRn features are exclusively Copia or Gypsy in association with DmelChr4_sim_denovoLibTEs_PC.classif_stats.txt.
- the MySQL table DmelChr4_sim_consensus_PC_classif is also created.
Classification file is tabulated on 12 colomns as the classification table:
- Consensus name in 'Seq_name',
- Consensus length (bp) in 'length',
- Consensus orientation in 'strand' values + or -,
- Is consensus confused? in 'confused' values True or False,
- Consensus class in 'class' values I or II if consensus classified as TE, value NA if consensus classified as not TE,
- Consensus order in 'order' values all order from Class I and class II if consensus classified as TE, value NA if consensus classified as not TE,
- Wicker code attributed to consensus in 'Wcode' values see in Wicker et al., Nat.Rev.Genet., 2007 if consensus classified as TE, value PHG/SSR/PrDNA if consensus classified as not TE,
- Consensus super family in 'sFamily' values see in Wicker et al., Nat.Rev.Genet., 2007 if consensus classified as TE, value NA if consensus classified as not TE,
- Confidence index in 'CI' value calculated by PASTEC using the decision rules from DecisionRule.yaml,
- Features found by homology in 'coding',
- Structural features in 'struct',
- Features from profiles tagged as 'other' and/or from other classification of confused consensus in 'other'
WARNING: if consensus is confused, the 'class', 'order', 'Wcode', 'sFamily' and 'CI' fields will contain all information separated by |.
See REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB) , TEdenovo Sim tab for more details.
This step filters the SSR and the consensus classified as "unclassified" only when they were built from less than 10 sequences.
In fact, before using the consensus library in the TEannot pipeline, you may want to filter them. For instance you may want to remove the consensus classified as SSR, HostGene, confused and unclassified.
To filter the consensus classified as "unclassified" only when they were built from less than 10 sequences, we use the "MSA program number". This number, in the header of each consensus after the name of the MSA program, corresponds to the number of sequences belonging to the multiple alignment from which the consensus was derived.
Edit TEdenovo.cfg section [filter_consensus] if you need to change the default parameters.
- "filter_SSR: yes " <- if set to yes, filter consensus classified as SSRs using parameter below
- "length_SSR: 0" <- length below which a SSR is filtered (e.g. 300, default=0 : all SSR are filtered)
- "filter_unclassified: yes" <- if set to yes, filter unclassified consensus using parameter below
- "filter_unclassified_max_fragments: 10" <- minimum number of sequences in the MSA from which the unclassified consensus has been built (default=10 :avoid)
- "filter_host_gene: no" <- if set to yes, filter host genes
- "filter_confused: no" <- if set to yes, filterconsensus classified as potential host genes (PHG)
- "filter_rDNA: no" <- if set to yes, filter consensus classified classified as potential rDNA (PrDNA)
When you are ready, launch the following command:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map
Results:
In the example, the directory "DmelChr4_Blaster_*_Map_TEclassif_Filtered" is created containing the output file "DmelChr4_denovoLibTEs_filtered.fa" which can be directly used in the TEannot pipeline.
The corresponding classif and stats files are also provided: DmelChr4_sim_denovoLibTEs_PC_filtered.classif and DmelChr4_sim_denovoLibTEs_PC_filtered.classif_stats.txt
See
REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB)
, TEdenovo Sim tab for more details.
ADVICE: Before using the TEannot pipeline, please READ the file "TEannot_tuto.txt".
For the last step, it is useful to investigate the relationships among the de novo consensus that have been built, by grouping them into clusters (i.e. "TE families").
This step launch blastclust or the MCL programs according to "-f" option.
Edit TEdenovo.cfg section [cluster_consensus] if you need to change the default parameters.
- "Blastclust_identity: 0" <- Score coverage threshold (bit score / length if < 3.0, percentage of identities otherwise)
- "Blastclust_coverage: 80" <- length coverage threshold above which consensus are regrouped in the same cluster
- "MCL_inflation: 1.5" <- Low inflation leads to coarser clusterings, high inflation leads to fine-grained clusterings.
- "MCL_coverage: 0.0" <- length coverage threshold above which consensus are regrouped in the same cluster
When you are ready, launch the following command if you use Blastclust as clustering tool:
- TEdenovo.py -P DmelChr4 -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f Blastclust
Results: In the example and depending on the chosen clustering method (-f option) either:
- DmelChr4_Blaster_Map_TEclassif_Filtered_Blastclust folder is created containing the output files DmelChr4_denovoLibTEs_filtered_Blastclust.fa in which consensus headers are rewrote with Blastclust cluster number and DmelChr4_denovoLibTEs_filtered_Blastclust.tab as clusters file with at each line the consensus headers in the same cluster.
- DmelChr4_Blaster_Map_TEclassif_Filtered_MCL is created containing the output files DmelChr4_denovoLibTEs_filtered_MCL.fa in which consensus headers are rewrote with MCL cluster number and DmelChr4_denovoLibTEs_filtered_MCL.tab as clusters file with at each line the consensus headers in the same cluster.
See REPET_v3.0_OutPutsPipelines.xlsx (64.21 kB) , TEdenovo Sim tab for more details.