PASTEClassifier tuto

This tutorial describes PASTEC (Pseudo Agent System of Transposable Element Classification), helps users to set parameters and gives the command lines. It is included in the delivery PASTEClassifier-2.0

Note: PASTEC is the REPET package Transposable Elements classifier used in the TEdenovo pipeline, and the corresponding stand alone tool is PASTEClassifier.py

The PASTEClassifier follows two main steps:

  • Step 1 : Detection of TEs features
  • Step 2 : Classification and post-treatments

After that, other processes are launched:

  • Remove redundancy between sequences
  • compute some statistics about the classification
  • compute reverse complement of the nucleotide sequences, if the classification process considers these sequences in negative strand (-r option)
  • Unclassified sequences clustering with classified ones
  • pre-curation on sequences classified LTR  (RLX)

More information


From now on, the project name is "ProjectName".

Setup your working environment

Set environment variables
REPET_PATH gives the absolute path to the srcDirectory where PASTEC has been installed.
    export REPET_PATH=$HOME/<srcDirectory>/repet_pipe/
Add the path towards REPET programs to your path:
    export PATH=$REPET_PATH/bin:...:$PATH

In the config directory, we find a template of setEnv.sh shell script in which these 2 environment variables can be set and source it.
   
Create your project directory (for instance "ProjectName") and go into it:

  •   cd $HOME/work/
  •   mkdir ProjectName
  •   cd ProjectName

Copy the input fasta file recording the genomic sequences, suspected to be transposable elements (TEs).
Format your fasta file with only 60 bps (or less) by line for each sequence.
About the sequence headers, it is highly advised to write them like this : ">XX_i" with XX standing for letters and i standing for numbers.
Please, avoid space (" ") or symbols such as "=", ";", ":", "|"...

Copy the configuration file:

  •   cp $HOME/<srcDirectory>/config/PASTEClassifier.cfg .

Edit the configuration file "PASTEClassifier.cfg" in order to adapt it to your personal situation.

  • In the section [repet_env] , indicate (ask your system administrator):
    • the host name of your MySQL database
    • your MySQL login
    • your MySQL password
    •  the name of your MySQL database
  • In the section [project] , indicate:
    • the name of your project (here: ProjectName)
    • the absolute path to your project directory (here: $HOME/work/ProjectName)
  • In [detect_features] and [classif_consensus] these sections indicate:
    • 'clean: yes' parameter will remove some temporary files.
    • 'resources' (optional): according to your data, you may need some specific resources (e.g. "mem_free=8G" if you need 8G of memory per job).
    • 'tmpDir' (optional): according to the computing cluster, give the name of the temporary directory of nodes (e.g. "/scratch").

Run PASTEClassifier

The standard output is rather self-explaining.
To avoid killing the main process of the pipeline by disconnecting from your session, it is highly advised to use the Unix command "nohup".
This program runs a command even if the session is disconnected or the user logs out.
To have more details, read the manual ("$ man nohup").

Command Line :
If you want to use a computing cluster to parallelize treatments:

  • PASTEClassifier.py -i ProjectName.fa -C PASTEClassifier.cfg -p >& PASTEC.log &

It is possible to launch the 2 PASTEC steps one by one:
    PASTEClassifier.py -i ProjectName.fa -C PASTEClassifier.cfg -S 1 -p >& PASTEC_step1.log &
    and then
    PASTEClassifier.py -i ProjectName.fa -C PASTEClassifier.cfg -S 2 -p >& PASTEC_step2.log &

To launch all steps of PASTEClassifier.py without parallelization

  • PASTEClassifier.py -i inpuFile.fa -C PASTEClassifier.cfg >& PASTEC.log &

 

Step 1 : "Features Detection"

PASTEClassifier looks for structural features, such as terminal repeats (termR), polyA tail, tandem repeats (to detect SSR and polyA tail), open reading frames (ORF).
It also looks for homology with known TEs, host genes, rDNA using blast and with HMM profiles using hmmer.
You can choose the programs to launch in the configuration file PASTEClassifier.cfg (see below). But be carefull of the dependencies (blast, hmmer, banks).
It is highly advised to read the "Banks" part, at the end of the tutorial, before using any bank.
Make sure you copy (or soft link) the databanks in your root project directory (here: $HOME/work/ProjectName) and indicate the file name of each data bank in the configuration file.
Adjustable parameters in [detect_features] section:

  • terminal repeats with TRsearch, set "term_rep: yes"
  • tandem repeats with TRF, set "tand_rep: yes"
  • open reading frames with dbORF.py, set "orf: yes"
  • poly-A tails with polyAtail, set "polyA: yes"
  • homology using blast, select the program by setting:
    • "blast: ncbi" for NCBI-BLAST
    •  "blast: blastplus" for NCBI-BLAST+
    •  "blast: wu" for WU-BLAST
  • helitron extremities, set "TE_BLRn: yes" and write the file name of the known TE bank in "TE_nucl_bank:" (nucleotide sequences)
  • homology with known TEs using tblastx, set "TE_BLRtx: yes" and write the file name of the known TE bank in "TE_nucl_bank:" (nucleotide sequences)
  • homology with known TEs using blastx, set "TE_BLRx: yes" and write the file name of the known TE bank in "TE_prot_bank:" (amino-acid sequences)
  • homology with host genes, set "HG_BLRn: yes" and write the file name of the cDNA bank in "HG_nucl_bank:"
  • homology with rDNA, set "rDNA_BLRn: yes" and write the file name of the rDNA bank in "rDNA_bank:"
  • homology with HMM profiles, set "TE_HMMER: yes" and write the file name of the HMM profiles bank in "TE_HMM_profiles:". You can also change the E-value threshold used by hmmer.
  • tRNA with TRNAscanSE, set "tRNA_scan: yes". WARNING: these results are not used for the classification. This functionality is still in exploration.

Results:
A directory is created for each selected features, containing the results files.   
In database, tables are also created for each selected features and will be used in the following step.

Step 2 : "classification"

This step classifies the sequences according to their features detected previously. The classification is made by the PASTEC tool.
For each sequence, PASTEC retrieves its features from the MySQL tables: "structural" features (LTR, TIR, polyA tails, SSR-like tails) and "coding" features (matches with known TEs, host genes, rDNA or HMM profiles).
PASTEC classifies elements based on the Wicker's classification (Wicker et al., Nat.Rev.Genet., 2007).
Adjustable parameters in [classif_consensus] section:

  •  "limit_job_nb: 0" <- parameter to limit the jobs number for PASTEC. Each job represents a PASTEC process, so one connection. But at the beginning, PASTEC retrieves the results from database.
    So depending on the amount of data in database and your computing cluster configuration (allowing, per example, 700 jobs running at the same time), MySQL server can be overloaded.
    You may want to limit the simultaneous connections to MySQL server for PASTEC (0 = no limit).
  •  "max_profiles_evalue: 1e-3" <- only matches on profiles bank below this e-value are kept
  •  "min_TE_profiles_coverage: 20"  <- minimal coverage between consensus and profiles for TEs
  •  "min_HG_profiles_coverage: 75"  <- minimal coverage between consensus and profiles host genes (if no other classif was found)
  •  "max_helitron_extremities_evalue: 1e-3" <- above this evalue, do not consider the match in regards to helitron classification
  •  "min_TE_bank_coverage: 5" <- min coverage above which match gets disregarded
  •  "min_HG_bank_coverage: 95" <- min coverage above which a consensus is considered as host gene
  •  "min_HG_bank_identity: 90" <- min identity above which a consensus is considered as host gene (used in conjunction with the coverage threshold above)
  •  "min_rDNA_bank_coverage: 95" <- min coverage above which a consensus is considered as rDNA
  •  "min_rDNA_bank_identity: 90" <- min identity above which a consensus is considered as rDNA (used in conjunction with the coverage threshold above)
  •  "min_SSR_coverage: 75" <- minimal percentage of SSR in the consensus
  •  "max_SSR_size: 100" <- max size to consider consensus as SSR
  •  "remove_redundancy: yes" <- to remove consensus classified as TEs considered as identical
  •  "min_redundancy_identity: 95" <- minimal identity beyond which two consensus are considered identical
  •  "min_redundancy_coverage: 98" <- minimal coverage beyond which two consensus are considered identical
  •  "rev_complement: yes" <- reverse sequence on negative strand
  •  "add_noCat_bestHitClassif: no" <- if yes, for each 'unclassified' consensus, PASTEC will specify the classification of closest TE consensus found by BLAST.

You can apply several post processes on your classified library:

  •  Remove the redundancy between consensus considered identical
    • "remove_redundancy: yes"
    • "min_redundancy_identity: 95"
    • "min_redundancy_coverage: 98"
  •  Add reverse in header sequence if it is the case  "rev_complement: yes"

 

  • Add information to noCat classified sequences. These one are compared with valid classified sequences and new classification is proposed.
    "add_noCat_bestHitClassif: yes"

Results:
The results are in three files types : fasta file (.fa extension) with TEs sequences, classification file (.classif extension see "Classification output" for details) containing classication information for each TE and statistics file   (.classif_stats.txt extension) about TEs library. ProjectName.fa, ProjectName.classif and ProjectName.classif_stats.txt.
If parameter remove_redundancy: yes, there are ProjectName_withoutRedundancy.fa, ProjectName_withoutRedundancy.classif and ProjectName_withoutRedundancy.classif_stats.txt.
If parameter rev_complement: yes, the headers of the reverse-complemented sequences is end tagged by "_reversed" in the fasta file         ProjectName_withoutRedundancy_negStrandReversed.fa, and in classif file ProjectName_withoutRedundancy_negStrandReversed.classif.
If parameter add_noCat_bestHitClassif: yes (no by default), the header of each sequence contains the best hit information in fasta file  <fileName>_noCatBestHit.fa
 ==>The final classified consensus library is ProjectName_denovoTEsLib.fa, ProjectName_denovoTEsLib.classif, ProjectName_denovoTEsLib.classif_stats.txt and the corresponding ProjectName_consensus_classif table in database.
WARNING : It also proposed a pre-curated library, where LTR sequences will be classified at super-family Gypsy/copia if they have only Gypsy or Copia coding features in ProjectName_denovoLibTEs_PC.classif file and ProjectName_denovoLibTEs_PC.classif_stats.txt as stats file.

Classification output

Classification file is tabulated on 12 colomns as the classification table:

  •  Consensus name in 'Seq_name',
  •  Consensus length (bp) in 'length',
  •  Consensus orientation in 'strand' values + or -,
  •  Is consensus confused? in 'confused' values True or False,
  •  Consensus class in 'class' values I or II if consensus classified as TE, value NA if consensus classified as not TE,
  •  Consensus order in 'order' values all order from Class I and class II if consensus classified as TE, value NA if consensus classified as not TE,
  •  Wicker code attributed to consensus in 'Wcode' values see in Wicker et al., Nat.Rev.Genet., 2007 if consensus classified as TE, value PHG/SSR/PrDNA if consensus classified as not TE,
  •  Consensus super family in 'sFamily' values see in Wicker et al., Nat.Rev.Genet., 2007 if consensus classified as TE, value NA if consensus classified as not TE,
  •  Confidence index in 'CI' value calculated by PASTEC using the decision rules from DecisionRule.yaml,
  •  Features found by homology in 'coding',
  •  Structural features in 'struct',
  •  Features from profiles tagged as 'other' and/or from other classification of confused consensus in 'other'
  • "NA" means "Not Available"

WARNING: if consensus is confused, the 'class', 'order', 'Wcode', 'sFamily' and 'CI' fields will contain all information separated by |.

Classification Examples:
blumeria_B_G590_Map20    391    .    False    Unclassified    Unclassified    NA    NA    NA    coding=()    struct=()    other=()
blumeria_B_G28830_Map3    5742    -    False    I    LTR    RLX    NA    50    coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.61%; TE_BLRx: ORF1p_TCN4#2:classI:LTR_retrotransposon: 35.45%; profiles: PF00078.19_RVT_1_RT_32.0: 90.91%, PF00078.19_RVT_1_RT_35.2: 90.91%, PF00665.18_rve_INT_32.0: 97.63%, PF00098.15_zf-CCHC_GAG_19.8: 83.33%, PF00665.18_rve_INT_18.8: 97.63%, PF00098.15_zf-CCHC_GAG_17.9: 83.33%)    struct=(TElength: >4000bps; TermRepeats: termLTR: 224; ORF: >1000bps, >1000bps)    other=(NA)
PN_sequenceA    8834    .    False    I    TRIM    RXX-TRIM    NA    40    coding=(NA)    struct=(TElength: <700bps; TermRepeats: termLTR: 4156)    other=(NA)
Alyr_B_R1138_Map20    8098    -    False    II    TIR    DTX    NA    50    coding=(profiles: PF03004.6_Transposase_24_Tase_14.0: 98.05%, PF02992.6_Transposase_21_Tase_25.0: 98.71%, PF03004.6_Transposase_24_Tase_25.0: 98.05%)    struct=(TElength: >1000bps; TermRepeats: termTIR: 33; ORF: >1000bps)    other=(NA)
Alyr_B_G7351_Map4    16061    -    True    I|II|I    LTR|Maverick|DIRS    RLX|DMX|RYX    NA    28|25|25    coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.15%, GYPSY10_I:classI:LTR_retrotransposon: 9.15%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 55.92%, GYPSY10_AG2p:classI:LTR_retrotransposon: 58.64%; profiles: PF00078.19_RVT_1_RT_32.0: 98.76%, PF00075.16_RnaseH_RH_7.5: 50.30%, PF00078.19_RVT_1_RT_35.2: 98.76%, PF03732.9_Retrotrans_gag_GAG_27.2: 98.00%, PF00665.18_rve_INT_32.0: 98.22%, PF03732.9_Retrotrans_gag_GAG_14.8: 98.00%, PF00665.18_rve_INT_18.8: 98.22%, PF00075.16_RnaseH_RH_21.9: 50.30%)    struct=(TElength: >10000bps; ORF: >1000bps, >1000bps, >1000bps, >1000bps, >3000bps)    other=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 9.15%, GYPSY10_I:classI:LTR_retrotransposon: 9.15%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 55.92%, GYPSY10_AG2p:classI:LTR_retrotransposon: 58.64%; profiles: PF00078.19_RVT_1_RT_32.0: 98.76%, PF00075.16_RnaseH_RH_7.5: 50.30%, PF00078.19_RVT_1_RT_35.2: 98.76%, PF03732.9_Retrotrans_gag_GAG_27.2: 98.00%, PF00665.18_rve_INT_32.0: 98.22%, PF03732.9_Retrotrans_gag_GAG_14.8: 98.00%, PF00665.18_rve_INT_18.8: 98.22%, PF00075.16_RnaseH_RH_21.9: 50.30%; TElength: >15000bps; TermRepeats: termTIR: 11; ORF: >1000bps, >1000bps, >1000bps, >1000bps, >3000bps)
MelaMite    324    .    False    II    MITE    DXX-MITE    NA    20    coding=(NA)    struct=(TElength: <700bps; TermRepeats: termTIR: 23)    other=(NA)
Seq_SSR    63    .    False    SSR    NA    NA    NA    100    coding=(NA)    struct=(TElength: <100bps; SSRCoverage=0.00)    other=(NA)
DmelCaf1_2_2-L-B230-Map1_reversed    7448    +    False    I    LTR    RLX    NA    64    coding=(TE_BLRtx: GYPSY10_I:classI:LTR_retrotransposon: 21.37%; TE_BLRx: GYPSY10_AG2p:classI:LTR_retrotransposon: 50.72%; profiles: PF00078.19_RVT_1_RT_32.0: 100.00%, PF00078.19_RVT_1_RT_35.2: 100.00%, PF00077.12_RVP_AP_30.9: 70.54%, PF00665.18_rve_INT_32.0: 97.63%, PF00665.18_rve_INT_18.8: 97.63%, PF00077.12_RVP_AP_1.0: 70.54%)    struct=(TElength: >4000bps; TermRepeats: termLTR: 531; ORF: >1000bps RT RT INT INT, >1000bps)    other=(NA)
denovoFragaria-B-G1199-Map5    1530    .    False    NA    NA    PHG    NA    100    coding=(Other_profiles: PF00664.18_ABC_membrane_NA_OTHER_22.0: 26.55%, PF00005.22_ABC_tran_NA_OTHER_23.1: 100.00%)    struct=(NA)    other=(NA)
Frag-B-R4858-Map5    2282    +    False    II    TIR    DTX    NA    62    coding=(TE_BLRtx: hAT-13_VVi:ClassII:TIR:hAT:?: 30.58%; TE_BLRx: hAT-13_VVi_1p:ClassII:TIR:hAT:?: 41.55%, hAT-1_PPe_1p:ClassII:TIR:hAT:?: 12.13%)    struct=(TElength: >1000bps) other=(Other_profiles: PF02892.10_zf-BED_NA_OTHER_20.6: 97.78%)
   
Banks

Known TEs:

You can use Repbase Update, which is a famous databank of know repeats (Jurka J. et al., Cytogentic and Genome Research, 2005).
To use it, you will have to register on "www.girinst.org".
Once you are registered, you can download a compressed archive with Repbase Update specifically formatted for REPET.
The archive contains two fasta files, one with nucleotide sequences given to BLASTER with blastn parameter "TE_BLRn: yes" either/or tblastx parameter "TE_BLRtx: yes". This bank file name should replace "<bank_of_TE_nucleotide_sequences_such_as_Repbase>" in "TE_nucl_bank: " parameter.
The other file with aminoacid sequences given to BLASTER with blastx parameter "TE_BLRx: yes". This bank file name should replace "<bank_of_TE_amino-acid_sequences_such_as_Repbase>" in "TE_prot_bank:" parameter.
If you have your own databank of known repeats, you can use it instead of Repbase or concatenate it at the end of Repbase. Take care of the way the sequence headers are formatted.
   
FORMAT : fasta with additional information in headers
Wicker's classification should be added at the end of each header, using ":" as separator. The pattern is "original header", "ClassI" or "ClassII", "order name", "superfamily name". For example, if the original header is ">TEName" and it's a TE classified as L1, the new header is :     >TEName:ClassI:LINE:L1
When the classification is not available at one level, a "?" can be specified (e.g. ">TEName:?:?:?" or ">TEName:ClassI:?:?").
WARNING : the original header should not have any ":".

HMM profiles:

You can download the bank ProfilesBankForREPET_Pfam26.0_GypsyDB.hmm at http://urgi.versailles.inra.fr/download/repet/, which comes from Pfam database (M. Punta, et al., Nucleic Acids Research, 2012) and also from GypsyDB ().
This version is specially formatted for REPET.
WARNING : this bank can only be used with hmmer3.
You should set "TE_HMM_profiles:" parameter with this bank. It is possible to search HMM profiles via hmmer2 or hmmer3, according to the format of the bank. You can also use your own bank, but each profile name has to be well formatted.

FORMAT : hmmer (.hmm) with additional information in NAME
According to the Pfam field, the original NAME is replaced by: <ACC>_<NAME>_<type according to key words found in DESC>_<GA>
For example, if the Pfam format is:
    //
    HMMER3/b [3.0 | March 2010]
    NAME  DUF3701
    ACC   PF12482.3
    DESC  Phage integrase protein
    LENG  96
    ALPH  amino
    RF    no
    CS    no
    MAP   yes
    DATE  Tue Sep 27 23:45:10 2011
    NSEQ  38
    EFFN  2.017822
    CKSUM 1455990630
    GA    21.50 21.50;
    ...
    the new NAME is  : PF12482.3_DUF3701_INT_21.5
    Details about TE profile type (and its abbreviation) can be found in the paper of Wicker et al. (see above).
    When no information is found in DESC (or NAME) to determine the profile type, we consider the profile not TE specific, and replace <type> by OTHER. These profiles can be used to classify sequences as "potential host genes".

Host genes:
If you have a bank of cDNA from the host genome, you can specify the file name in "HG_nucl_bank:" parameter.
FORMAT : fasta

rDNA:
If you have a bank of rDNA from eukaryota, you can specify the file name in "rDNA_bank:" parameter.
FORMAT : fasta

eZ Publish™ copyright © 1999-2024 eZ Systems AS