This tutorial describes PASTEC (Pseudo Agent System of Transposable Element Classification), helps users to set parameters and gives the command lines. It is included in the delivery PASTEClassifier-2.0
Note: PASTEC is the REPET package Transposable Elements classifier used in the TEdenovo pipeline, and the corresponding stand alone tool is PASTEClassifier.py
The PASTEClassifier follows two main steps:
After that, other processes are launched:
More information
From now on, the project name is "ProjectName".
Setup your working environment
Set environment variables
REPET_PATH gives the absolute path to the srcDirectory where PASTEC has been installed.
export REPET_PATH=$HOME/<srcDirectory>/repet_pipe/
Add the path towards REPET programs to your path:
export PATH=$REPET_PATH/bin:...:$PATH
In the config directory, we find a template of setEnv.sh shell script in which these 2 environment variables can be set and source it.
Create your project directory
(for instance "ProjectName") and go into it:
Copy the input fasta file recording the genomic sequences, suspected to be transposable elements (TEs).
Format your fasta file with only 60 bps (or less) by line for each sequence.
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ") or symbols such as "=", ";", ":", "|"...
Copy the configuration file:
Edit the configuration file "PASTEClassifier.cfg" in order to adapt it to your personal situation.
Run PASTEClassifier
The standard output is rather self-explaining.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".
This program runs a command even if the session is disconnected or the user logs out.
To have more details, read the manual ("$ man nohup").
Command Line :
If you want to use a computing cluster to parallelize treatments:
It is possible to launch the 2 PASTEC steps one by one:
PASTEClassifier.py -i ProjectName.fa -C PASTEClassifier.cfg -S 1 -p >& PASTEC_step1.log &
and then
PASTEClassifier.py -i ProjectName.fa -C PASTEClassifier.cfg -S 2 -p >& PASTEC_step2.log &
To launch all steps of PASTEClassifier.py without parallelization
PASTEClassifier looks for structural features, such as terminal repeats (termR), polyA tail, tandem repeats (to detect SSR and polyA tail), open reading frames (ORF).
It also looks for homology with known TEs, host genes, rDNA using blast and with HMM profiles using hmmer.
You can choose the programs to launch in the configuration file
PASTEClassifier.cfg
(see below). But be carefull of the dependencies (blast, hmmer, banks).
It is highly advised to read the "Banks" part, at the end of the tutorial, before using any bank.
Make sure you copy (or soft link) the databanks in your root project directory (here: $HOME/work/ProjectName) and indicate the file name of each data bank in the configuration file.
Adjustable parameters in
[detect_features]
section:
Results:
A directory is created for each selected features, containing the results files.
In database, tables are also created for each selected features and will be used in the following step.
This step classifies the sequences according to their features detected previously. The classification is made by the PASTEC tool.
For each sequence, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" features (matches with known TEs, host genes, rDNA or HMM profiles).
PASTEC classifies elements based on the Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007).
Adjustable parameters in
[classif_consensus]
section:
You can apply several post processes on your classified library:
Results:
The results are in three files types : fasta file (.fa extension) with TEs sequences, classification file (.classif extension see "Classification output" for details) containing classication information for each TE and statistics file (.classif_stats.txt extension) about TEs library. ProjectName.fa, ProjectName.classif and ProjectName.classif_stats.txt.
If parameter remove_redundancy: yes, there are ProjectName_withoutRedundancy.fa, ProjectName_withoutRedundancy.classif and ProjectName_withoutRedundancy.classif_stats.txt.
If parameter rev_complement: yes, the headers of the reverse-complemented sequences is end tagged by "_reversed" in the fasta file ProjectName_withoutRedundancy_negStrandReversed.fa, and in classif file ProjectName_withoutRedundancy_negStrandReversed.classif.
If parameter add_noCat_bestHitClassif: yes (no by default), the header of each sequence contains the best hit information in fasta file <fileName>_noCatBestHit.fa
==>The final classified consensus library is ProjectName_denovoTEsLib.fa, ProjectName_denovoTEsLib.classif, ProjectName_denovoTEsLib.classif_stats.txt and the corresponding ProjectName_consensus_classif table in database.
WARNING
: It also proposed a pre-curated library, where LTR sequences will be classified at super-family Gypsy/copia if they have only Gypsy or Copia coding features in ProjectName_denovoLibTEs_PC.classif file and ProjectName_denovoLibTEs_PC.classif_stats.txt as stats file.
Classification file is tabulated on 12 colomns as the classification table:
WARNING: if consensus is confused, the 'class', 'order', 'Wcode', 'sFamily' and 'CI' fields will contain all information separated by |.
Classification Examples:
blumeria_B_G590_Map20 391 . False Unclassified Unclassified NA NA NA coding=() struct=() other=()
blumeria_B_G28830_Map3 5742 - False I LTR RLX NA 50 coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.61%; TE_BLRx: ORF1p_TCN4#2:classI:LTR_retrotransposon: 35.45%; profiles: PF00078.19_RVT_1_RT_32.0: 90.91%, PF00078.19_RVT_1_RT_35.2: 90.91%, PF00665.18_rve_INT_32.0: 97.63%, PF00098.15_zf-CCHC_GAG_19.8: 83.33%, PF00665.18_rve_INT_18.8: 97.63%, PF00098.15_zf-CCHC_GAG_17.9: 83.33%) struct=(TElength: >4000bps; TermRepeats: termLTR: 224; ORF: >1000bps, >1000bps) other=(NA)
PN_sequenceA 8834 . False I TRIM RXX-TRIM NA 40 coding=(NA) struct=(TElength: <700bps; TermRepeats: termLTR: 4156) other=(NA)
Alyr_B_R1138_Map20 8098 - False II TIR DTX NA 50 coding=(profiles: PF03004.6_Transposase_24_Tase_14.0: 98.05%, PF02992.6_Transposase_21_Tase_25.0: 98.71%, PF03004.6_Transposase_24_Tase_25.0: 98.05%) struct=(TElength: >1000bps; TermRepeats: termTIR: 33; ORF: >1000bps) other=(NA)
Alyr_B_G7351_Map4 16061 - True I|II|I LTR|Maverick|DIRS RLX|DMX|RYX NA 28|25|25 coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.15%, GYPSY10_I:classI:LTR_retrotransposon: 9.15%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 55.92%, GYPSY10_AG2p:classI:LTR_retrotransposon: 58.64%; profiles: PF00078.19_RVT_1_RT_32.0: 98.76%, PF00075.16_RnaseH_RH_7.5: 50.30%, PF00078.19_RVT_1_RT_35.2: 98.76%, PF03732.9_Retrotrans_gag_GAG_27.2: 98.00%, PF00665.18_rve_INT_32.0: 98.22%, PF03732.9_Retrotrans_gag_GAG_14.8: 98.00%, PF00665.18_rve_INT_18.8: 98.22%, PF00075.16_RnaseH_RH_21.9: 50.30%) struct=(TElength: >10000bps; ORF: >1000bps, >1000bps, >1000bps, >1000bps, >3000bps) other=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.15%, GYPSY10_I:classI:LTR_retrotransposon: 9.15%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 55.92%, GYPSY10_AG2p:classI:LTR_retrotransposon: 58.64%; profiles: PF00078.19_RVT_1_RT_32.0: 98.76%, PF00075.16_RnaseH_RH_7.5: 50.30%, PF00078.19_RVT_1_RT_35.2: 98.76%, PF03732.9_Retrotrans_gag_GAG_27.2: 98.00%, PF00665.18_rve_INT_32.0: 98.22%, PF03732.9_Retrotrans_gag_GAG_14.8: 98.00%, PF00665.18_rve_INT_18.8: 98.22%, PF00075.16_RnaseH_RH_21.9: 50.30%; TElength: >15000bps; TermRepeats: termTIR: 11; ORF: >1000bps, >1000bps, >1000bps, >1000bps, >3000bps)
MelaMite 324 . False II MITE DXX-MITE NA 20 coding=(NA) struct=(TElength: <700bps; TermRepeats: termTIR: 23) other=(NA)
Seq_SSR 63 . False SSR NA NA NA 100 coding=(NA) struct=(TElength: <100bps; SSRCoverage=0.00) other=(NA)
DmelCaf1_2_2-L-B230-Map1_reversed 7448 + False I LTR RLX NA 64 coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 21.37%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 50.72%; profiles: PF00078.19_RVT_1_RT_32.0: 100.00%, PF00078.19_RVT_1_RT_35.2: 100.00%, PF00077.12_RVP_AP_30.9: 70.54%, PF00665.18_rve_INT_32.0: 97.63%, PF00665.18_rve_INT_18.8: 97.63%, PF00077.12_RVP_AP_1.0: 70.54%) struct=(TElength: >4000bps; TermRepeats: termLTR: 531; ORF: >1000bps RT RT INT INT, >1000bps) other=(NA)
denovoFragaria-B-G1199-Map5 1530 . False NA NA PHG NA 100 coding=(Other_profiles: PF00664.18_ABC_membrane_NA_OTHER_22.0: 26.55%, PF00005.22_ABC_tran_NA_OTHER_23.1: 100.00%) struct=(NA) other=(NA)
Frag-B-R4858-Map5 2282 + False II TIR DTX NA 62 coding=(TE_BLRtx: hAT-13_VVi:ClassII:TIR:hAT:?: 30.58%; TE_BLRx: hAT-13_VVi_1p:ClassII:TIR:hAT:?: 41.55%, hAT-1_PPe_1p:ClassII:TIR:hAT:?: 12.13%) struct=(TElength: >1000bps) other=(Other_profiles: PF02892.10_zf-BED_NA_OTHER_20.6: 97.78%)
Banks
Known TEs:
You can use
Repbase
Update, which is a famous databank of know repeats (Jurka J. et al., Cytogentic and Genome Research, 2005).
To use it, you will have to register on "www.girinst.org".
Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET.
The archive contains two fasta files, one with nucleotide sequences given to BLASTER with blastn parameter "TE_BLRn: yes" either/or tblastx parameter "TE_BLRtx: yes". This bank file name should replace "<bank_of_TE_nucleotide_sequences_such_as_Repbase>" in "TE_nucl_bank: " parameter.
The other file with aminoacid sequences given to BLASTER with blastx parameter "TE_BLRx: yes". This bank file name should replace "<bank_of_TE_amino-acid_sequences_such_as_Repbase>" in "TE_prot_bank:" parameter.
If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.
FORMAT : fasta with additional information in headers
Wicker's classification should be added at the end of each header, using ":" as separator. The pattern is "original header", "ClassI" or "ClassII", "order name", "superfamily name". For example, if the original header is ">TEName" and it's a TE classified as L1, the new header is : >TEName:ClassI:LINE:L1
When the classification is not available at one level, a "?" can be specified (e.g. ">TEName:?:?:?" or ">TEName:ClassI:?:?").
WARNING
: the original header should not have any ":".
HMM profiles:
You can download the bank ProfilesBankForREPET_Pfam26.0_GypsyDB.hmm at http://urgi.versailles.inra.fr/download/repet/, which comes from Pfam database (M. Punta, et al., Nucleic Acids Research, 2012) and also from
GypsyDB
().
This version is specially formatted for REPET.
WARNING
: this bank can only be used with hmmer3.
You should set "TE_HMM_profiles:" parameter with this bank. It is possible to search HMM profiles via hmmer2 or hmmer3, according to the format of the bank. You can also use your own bank, but each profile name has to be well formatted.
FORMAT : hmmer (.hmm) with additional information in NAME
According to the Pfam field, the original NAME is replaced by: <ACC>_<NAME>_<type according to key words found in DESC>_<GA>
For example, if the Pfam format is:
//
HMMER3/b [3.0 | March 2010]
NAME DUF3701
ACC PF12482.3
DESC Phage integrase protein
LENG 96
ALPH amino
RF no
CS no
MAP yes
DATE Tue Sep 27 23:45:10 2011
NSEQ 38
EFFN 2.017822
CKSUM 1455990630
GA 21.50 21.50;
...
the new NAME is : PF12482.3_DUF3701_INT_21.5
Details about TE profile type (and its abbreviation) can be found in the paper of Wicker et al. (see above).
When no information is found in DESC (or NAME) to determine the profile type, we consider the profile not TE specific, and replace <type> by OTHER. These profiles can be used to classify sequences as "potential host genes".
Host genes:
If you have a bank of cDNA from the host genome, you can specify the file name in "HG_nucl_bank:" parameter.
FORMAT : fasta
rDNA:
If you have a bank of rDNA from eukaryota, you can specify the file name in "rDNA_bank:" parameter.
FORMAT : fasta