REPET practical course urgi

From wikia2
Jump to: navigation, search

Contents

Practical course: Transposable Elements identification with The REPET package

Use case: 4th Chromosome of Arabidopsis thaliana

Run the REPET pipelines

Setup The REPET package environment

  • Connect to the virtual machine containing the REPET installation:
ssh -XY -p $port centos@localhost
  • Your home directory is by default : "/home/centos"
  • To start a new project, create a folder with the project name « ThalChr4 » :
mkdir ThalChr4
  • Change directory
cd ThalChr4
-Check the database parameters in the « setEnv.sh » configuration file:
more ~/data/setEnv.sh
export REPET_HOST="localhost"
export REPET_USER="orepet"
export REPET_PW="repet_pw"
export REPET_DB="repet"
export REPET_PORT="3306"

export REPET_PATH="/usr/local/REPET_linux-x64-2.5"
export PYTHONPATH=$REPET_PATH
export REPET_JOBS=MySQL
export REPET_JOB_MANAGER=slurm
export REPET_QUEUE=slurm
export SMART_PATH=$REPET_PATH/SMART/Java/Python
export PATH=$SMART_PATH:$REPET_PATH/bin:$PATH
...
  • Source the environment before launching REPET pipeline:
. ~/data/setEnv.sh
  • Test the connexion to the MySQL database:
mysql -h $REPET_HOST -u $REPET_USER -p$REPET_PW $REPET_DB
  • exit the database:
quit

Start TEdenovo pipeline

  • Create a directory to launch TEdenovo
mkdir TEdenovo; cd TEdenovo
  • Make a link (ln -s) to access the input fasta file of the genomic sequences – The genome fasta file must be “project_name.fa”
ln -s ~/data/TA_Chr4.fa ThalChr4.fa
  • Make a link (ln -s) to access the databanks used in similarity based classification.
ln -s ~/data/ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm
ln -s ~/data/repbase20.05_aaSeq_cleaned_TE.fsa
ln -s ~/data/repbase20.05_ntSeq_cleaned_TE.fsa
ln -s ~/data/rRNA_Eukaryota.fsa
  • Copy the configuration file « TEdenovo.cfg », into your TEdenovo working directory:
(The original TEdenovo.cfg is available at “$REPET_PATH/config/TEdenovo.cfg”)
cp ~/data/TEdenovo.cfg ./


-Check if the configuration file is properly filled before launching TEdenovo:
gedit TEdenovo.cfg >/dev/null 2>&1 &
[repet_env]
repet_version: 2.5
repet_host: localhost
repet_user: orepet
repet_pw: repet_pw
repet_db: repet
repet_port: 3306
repet_job_manager: slurm
[project]
project_name: ThalChr4
project_dir: /home/centos/ThalChr4/TEdenovo
…
[detect_features]
…
TE_BLRn: yes
TE_BLRtx: yes
TE_nucl_bank: repbase20.05_ntSeq_cleaned_TE.fsa
TE_BLRx: yes
TE_prot_bank: repbase20.05_aaSeq_cleaned_TE.fsa
TE_HMMER: yes
TE_HMM_profiles:  ProfilesBankForREPET_Pfam27.0_GypsyDB.hmm
…
rDNA_BLRn: yes
rDNA_bank: rRNA_Eukaryota.fsa
  • TEdenovo pipeline consists of 8 steps that can be launched using only one command line:
nohup launch_TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -f MCL >& TEdenovo.log &
P: project name
f: clustering program used to find consensus families
  • Useful commands to follow the progress of steps

- job status (under slurm)

squeue

- the log files. ex:

more TEdenovo.log
tail TEdenovo.log


Alternatively, you can launch the TEdenovo pipeline step by step:

nohup TEdenovo.py -P name -C config.cfg -S step -[specific-step-param]


TEdenovo 1-2.png

  • Step 1: Genomic sequences are cut and grouped into batches
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 1 >& runS1.log &
  • Step 2: The genome is aligned to itself using BLAST
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 2 -s Blaster >& runS2.log &


TEdenovo 3.png

  • Step 3: The repetitives HSP from BLAST are clustered by Recon, Grouper and/or Piler
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Grouper >& runS3G.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Recon >& runS3R.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 3 -s Blaster -c Piler >& runS3P.log &


TEdenovo 4.png

  • Step 4: A multiple alignment is computed for each cluster, and a consensus sequence is derived from each multiple alignment
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Grouper -m Map >& runS4G.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Recon -m Map >& runS4R.log &
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 4 -s Blaster -c Piler -m Map >& runS4P.log &


TEdenovo 5-6-7.png

  • Step 5: Particular features are detected on each consensus, such as structural features or homology with known TE, HMM profiles or host genes
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 5 -s Blaster -c GrpRecPil -m Map >& runS5.log &
mySQL table are created: contain the evidences of consensus annotation used by Pastec classifier
  • Step 6: The consensuses are classified using Wicker's TEs classification
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 6 -s Blaster -c GrpRecPil -m Map >& runS6.log &
  • Step 7: SSR and under-represented unclassified ("noCat") consensus are filtered
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 7 -s Blaster -c GrpRecPil -m Map >& runS7.log &
  • Step 8: The consensuses are clustered into families to facilitate manual curation using Blastclust or MCL
nohup TEdenovo.py -P ThalChr4 -C TEdenovo.cfg -S 8 -s Blaster -c GrpRecPil -m Map -f MCL >& runS8.log &


Post TEdenovo pipeline

Get all the annotations done by PASTEC on the Consensus

A GFF file will be created for each analysis output of the Step 5(detect feature), these GFF annotations files can be viewed in a genome browser such as IGV:

  • Copy the configuration files « CreateGFF3sForClassifFeatures.cfg » into your working directory:
cp ~/data/CreateGFF3sForClassifFeatures.cfg ./
  • Check if the configuration file is properly filled before launching CreateGFF3sForClassifFeatures:
gedit CreateGFF3sForClassifFeatures.cfg >/dev/null 2>&1 &
[repet_env]
repet_version: 2.5
repet_host: localhost
repet_user: orepet
repet_pw: repet_pw
repet_db: repet
repet_port: 3306
repet_job_manager: slurm

[project]
project_name: ThalChr4
project_dir: /home/centos/ThalChr4/TEdenovo


[gff3_TEdenovo_options]
add_classif_infos: yes
TR: yes
polyA: yes
ORF: yes
TE_BLRn: yes
TE_BLRtx: yes
TE_BLRx: yes
HG_BLRn: no
rDNA_BLRn: yes
tRNA: no
Profiles: yes
SSR: yes

[gff3_TEannot_options]
project_name_teannot: ThalChr4
annotated_copies: no

[other]
original_HSP: yes
  • Launch the CreateGFF3sForClassifFeatures:
nohup CreateGFF3sForClassifFeatures.py -C CreateGFF3sForClassifFeatures.cfg -f ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/ThalChr4_sim_denovoLibTEs_filtered.fa -v 3 >& CreateGFF3sForClassifFeatures.log &
C: Configuration file"
f: Consensus sequence (fasta file) provided by the TEdenovo.

A new directory "Visualization_Files" is created

  • Reverse-complement the coordinates of "*_reversed" consensus

Indeed, the consensus annotations used to classify the consensus are performed before the step 6 where the consensus are “reverse-complemented”. The coordinates of these annotations are not reversed in the database tables. So we need a patch for GFF files provided the CreateGFF3sForClassifFeatures.py of the release 2.5 (it will be including in the next release of REPET v3).

- Create a new directory for reverse-complemented GFF
cd Visualization_Files/; mkdir gff_reversed

- Create a file with 2 columns consensus name and length
cut -f1,2 ../ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/classifFileFromList.classif > ThalChr4_sim_denovoLibTEs_filtered.len

- Reverse complement
for file in `ls *.gff3`;
do
grep -P "^#" $file > gff_reversed/$file;
while read TE len;
do gawk -F"\t" '{if($1 ~ /_reversed/ && $1 ~ /'$TE'/){rstart='$len'-$5+1;rend='$len'-$4+1; if($7 ~ /+/){rstr="-"}; if($7 ~ /-/){rstr="+"};OFS="\t";print $1,$2,$3,rstart,rend,$6,rstr,$8,$9}else{if($1 ~ /'$TE'/){print $0}}}' $file;
done < ThalChr4_sim_denovoLibTEs_filtered.len >> gff_reversed/$file;
done

Get the multiple-alignment used to build the consensus

The "original_HSP: yes" option in the CreateGFF3sForClassifFeatures.cfg config file creates a new directory "Original_HSP_fastaAlignment" with Symbolic links to the multiple-alignment used to build the consensus.
These file can be loaded and browsed in Jalview - Note that they are not reversed, a base is kept in the consenus only if shared by at least 2 HSPs.

Visualization_Files/Original_HSP_fastaAlignment/*.fa_aln

Start TEannot pipeline

  • Copy the configuration files « TEannot.cfg » into your working directory:
(The original TEannot.cfg file is available at $REPET_PATH/config/TEannot.cfg)
cd ; cd ThalChr4
mkdir TEannot/; cd TEannot/
cp ~/data/TEannot.cfg ./
  • Check if the configuration file is properly filled before launching TEannot:
gedit TEannot.cfg >/dev/null 2>&1 &
[repet_env]
repet_version: 2.5
repet_host: localhost
repet_user: orepet
repet_pw: repet_pw
repet_db: repet
repet_port: 3306
repet_job_manager: slurm
[project]
project_name: ThalChr4
project_dir: /home/centos/ThalChr4/TEannot
…
[export]
…
gff3_merge_redundant_features: yes
gff3_compulsory_match_part: yes
gff3_with_genomic_sequence: no
gff3_with_TE_length: yes
gff3_with_classif_info: yes
classif_table_name: ThalChr4_sim_consensus_classif
  • Link to the TEdenovo consensus library
This library contains consensus after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR
ln -s ../TEdenovo/ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/ThalChr4_sim_denovoLibTEs_filtered.fa ThalChr4_refTEs.fa
  • Link to the input fasta file of the genomic sequences
ln -s ../TEdenovo/ThalChr4.fa
  • Source the environment before launching REPET pipeline (if new terminal window after TEdenovo)
. ~/data/setEnv.sh
  • TEannot pipeline consists of 8 steps that you can launch using only one command line:
nohup launch_TEannot.py -P ThalChr4 -C TEannot.cfg -e >& TEannot.log &
P: project_name

Alternatively, you can launch the TEannot.py pipeline step by step:

nohup TEannot.py -P name -C config.cfg -S step -[specific-step-param]


TEannot 1-2-3.png

  • Step 1: The first step prepares all the data banks required in the next steps
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 1 > S1.log >& runS1.log &
  • Step 2: aligns the reference TE sequences on each genomic chunk via BLASTER (high sensitivity, followed by MATCHER) AND/OR REPEATMASKER (cutoff at 200) AND/OR CENSOR (high sensitivity)
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a BLR >& runS2BLR.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a RM >& runS2RM.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a CEN >& runS2CEN.log &
  • Step 2 bis: idem to step 2 on randomized sequences to generate filter threshold
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a BLR -r >& runS2BLRr.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a RM -r >& runS2RMr.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 2 -a CEN -r >& runS2CENr.log &
  • Step 3: filters and combines the HSPs obtained at step 2, i.e. the TE annotations
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 3 -c BLR+RM+CEN >& runS3.log &


TEannot 4-5.png

  • Step 4: search for satellites on the genomic sequences via TRF, Mreps and RepeatMasker
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 4 -s TRF >& runS4TRF.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 4 -s Mreps >& runS4Mreps.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 4 -s RMSSR >& runS4RMSSR.log &
  • Step 5: merges the SSR annotations from the 3 programs used at the previous step
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 5 >& runS5.log &
  • Step 6: compares a data bank (nucleotides or amino-acids, in fasta format, e.g. Repbase Update)
(not mandatory) - Useful when TE are too degenerated to build "reliable" consensus
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 6 -b tblastx >& runS6btx.log &
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 6 -b blastx >& runS6bx.log &


TEannot 7.png

  • Step 7: performs successive procedures such as removal of redundant TE, removal of SSR annotations included into TE annotations and "long join procedure"
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 7 >& runS7.log &
  • Step 8: export annotations to GFF3 format
nohup TEannot.py -P ThalChr4 -C TEannot.cfg -S 8 -o GFF3 >& runS8.log &

Post TEannot pipeline

Concatenate all gff files of genome annotation in one

The outputs of TEannot step 8 are genome annotations in GFF3 format (and/or gameXML):
  • In this practical course, we annotated only the chromosome 4.
cat ThalChr4_GFF3chr/*.gff3 |grep -v "##" > ThalChr4_refTEs.gff

Compute statistics of TE genome annotation

  • Launch the "PostAnalyzeTELib.py" script to generate statistics about identified TE during the TEdenovo pipeline.
nohup PostAnalyzeTELib.py -a 3 -g 18585056 -p ThalChr4_chr_allTEs_nr_noSSR_join_path -s ThalChr4_refTEs_seq -v 2 >& runPostAnalyze.log &
g: Genome length (A. thaliana 4_CHROMOSOME).
p: Project name + "chr_allTEs_nr_noSSR_join_path"
s: Project name + "_refTEs_seq"

Compute and plot the consensuses coverage

  • Launch the "plotCoverage.py". Each output image file (plotCoverage/*.png) correspond to a plot of the coordinates of copies on their respective TE consensus sequences.
mkdir plotCoverage
python $PYTHONPATH/SMART/Java/Python/plotCoverage.py -i ThalChr4_refTEs.gff -f gff3 -q ThalChr4_refTEs.fa --merge -l grey -o plotCoverage/ThalChr4 >& runPlotCoverage.log &
rm *.Rout
i: Genome annotation file (gff).
f: the file format
q: the consensus sequences used in the TEannot
o: output directory and project_name prefixe

Select consensus for the second round of TEannot

  • Launch the "GetSpecificTELibAccordingToAnnotation.py" to select 3 subset of the consensus library used in the 1st TEannot
nohup GetSpecificTELibAccordingToAnnotation.py -i ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE.tab -t ThalChr4_refTEs_seq -v1 >& GetSpecificTELibAccordingToAnnotation.log &
i: Output file of PostAnalyzeTELib.py (statistics per consensus).
t: MySQL table containing the consensus sequences
  • get the number of consenus by category
egrep -c ">" ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_*.fa
ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthCopy.fa:95
ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.fa:93
ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_OneCopyAndMore.fa:160
  • get the list of consensus with at least one full-length fragment in the genome
egrep ">" ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.fa |sed 's/>//' > ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.lst
DHX-incomp-chim_ThalChr4-B-P3.24-Map4_reversed
DHX-incomp-chim_ThalChr4-B-R5-Map8_reversed
DHX-incomp_ThalChr4-B-G148-Map3
DHX-incomp_ThalChr4-B-G155-Map3_reversed
DHX-incomp_ThalChr4-B-G166-Map5
DHX-incomp_ThalChr4-B-G178-Map3
DHX-incomp_ThalChr4-B-G181-Map3
DHX-incomp_ThalChr4-B-G210-Map3_reversed
...
  • One can use this list to restrict the previous result files to these consensus list
grep -F -f ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthFrag.lst A_result_file > A_result_file_FLF

Results analysis

TEdenovo most interesting output files

cd ~/ThalChr4/TEdenovo

Output directories

   ThalChr4_db							-step1: chunks and batches
   ThalChr4_Blaster						-step2: Blaster results 
   ThalChr4_Blaster_Grouper					-step3: Grouper clustering
   ThalChr4_Blaster_Recon					-step3: Recon clustering
   ThalChr4_Blaster_Piler					-step3: Piler clustering
   ThalChr4_Blaster_Grouper_Map				        -step4: Multiple alignment for each Grouper cluster
   ThalChr4_Blaster_Recon_Map					-step4: Multiple alignment for each Recon cluster
   ThalChr4_Blaster_Piler_Map					-step4: Multiple alignment for each Piler cluster
   ThalChr4_Blaster_GrpRecPil_Map_TEclassif/detectFeatures/ 	-step5: Output of all programs used to detect features
   ThalChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus	-step6: consensus classification
   ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/		-step7: consensus filtered for SSR and under-represented noCat 
   ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL	-step8: MCL clustering of consensus

TEdenovo consensus library

ThalChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/ThalChr4_sim_withoutRedundancy_negStrandReversed_WickerH.fa
>RLX-incomp_ThalChr4-B-G1-Map20_reversed
TCGAGTAGAGTCCTTTTAAGCTCCTTCTGCACCTGAAAACACACCAAAACATGCAATGTG
…
>RLX-incomp_ThalChr4-B-G10-Map5
TGATGCCATTCCCTATCTATTAGAACCTGAACTAAATTTGCAATTATCATGTCTATGCAT
…

TEdenovo consensus library after filtering of “noCat” consensus built using less than 10 copies and consensus classified as SSR – We use this library in TEannot pipeline

ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered/ThalChr4_sim_denovoLibTEs_filtered.fa

Classification of TEdenovo consensus library according to Wicker classification nomenclature

ThalChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/ThalChr4_sim_withoutRedundancy_negStrandReversed_WickerH.classif

- Legend

Seq_name	length	strand	status	class_classif	order_classif	completeness	evidence
RXX-LARD_ThalChr4-B-G100-Map9	11362	.	ok	I	LARD	NA	CI=20; struct=(TElength: >1000bps; TermRepeats: termLTR: 4854); other=(Other_profiles: PF06721.6_DUF1204_NA_OTHER_21.0: 398.71%(100.00%), PF01657.12_Stress-antifung_NA_OTHER_25.0: 185.85%(99.06%); TermRepeats: non-termLTR: 3248; SSRCoverage=0.07)
RXX-LARD_ThalChr4-B-G102-Map3	15048	.	ok	I	LARD	NA	CI=40; struct=(TElength: >4000bps; TermRepeats: termLTR: 5194); other=(TermRepeats: non-termLTR: 6164; SSR: (TAAACCC)19_end; SSRCoverage=0.46)
RXX-LARD_ThalChr4-B-G103-Map6_reversed	15894	-	ok	I	LARD	NA	CI=20; struct=(TElength: >1000bps; TermRepeats: termLTR: 6149); other=(TE_BLRtx: ATENSAT1:ClassII:?:?: 99.13%; TermRepeats:non-termLTR: 5902; SSRCoverage=0.25)
noCat_ThalChr4-B-G107-Map3	473	.	ok	noCat	noCat	NA	CI=NA; struct=(SSRCoverage=0.00)
RLX-incomp_ThalChr4-B-G10-Map5	1295	.	ok	I	LTR	incomplete	CI=14; coding=(TE_BLRtx: ATHILA0_I:ClassI:LTR:Gypsy: 10.33%, ATHILA3_LTR:ClassI:LTR:Gypsy: 6.89%); struct=(TElength: >700bps); other=(SSRCoverage=0.13)
RLX-incomp_ThalChr4-B-G110-Map4_reversed	381	-	ok	I	LTR	incomplete	CI=7; coding=(TE_BLRtx: ATCOPIA9LTR:ClassI:LTR:Copia: 92.43%); struct=(TElength: <700bps); other=(SSRCoverage=0.26)
RXX-TRIM_ThalChr4-B-G115-Map7	630	.	ok	I	TRIM	NA	CI=40; struct=(TElength: <700bps; TermRepeats: termLTR: 274); other=(TermRepeats: non-termLTR: 176; SSRCoverage=0.08)
DTX-comp_ThalChr4-B-G116-Map5	412	+	ok	II	TIR	complete	CI=12; coding=(TE_BLRtx: ATTIRX1C:ClassII:TIR:?: 99.75%); struct=(TElength: <700bps; TermRepeats: termTIR: 31); other=(SSRCoverage=0.11)

Classification statistics

ThalChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/ThalChr4_sim_withoutRedundancy_negStrandReversed_WickerH.classif_stats.txt
LARD total (RXX-LARD): 4 (2.23%)
LINE incomp: 2 (1.12%)
LINE potential chimeric*: 2 (1.12%)
LINE total (RIX): 2 (1.12%)
LTR comp: 5 (2.79%)
LTR incomp: 72 (40.22%)
LTR potential chimeric*: 3 (1.68%)
LTR total (RLX): 77 (43.02%)
SINE incomp: 2 (1.12%)
SINE potential chimeric*: 2 (1.12%)
SINE total (RSX): 2 (1.12%)
TRIM potential chimeric*: 1 (0.56%)
TRIM total (RXX-TRIM): 5 (2.79%)

ClassI + one order: 90 (50.28%)
ClassI potential chimeric*: 8 (4.47%)
ClassI total (RXX): 90 (50.28%)

Helitron incomp: 30 (16.76%)
Helitron potential chimeric*: 3 (1.68%)
Helitron total (DHX): 30 (16.76%)
Maverick incomp: 1 (0.56%)
Maverick total (DMX): 1 (0.56%)
TIR comp: 3 (1.68%)
TIR incomp: 32 (17.88%)
TIR potential chimeric*: 5 (2.79%)
TIR total (DTX): 35 (19.55%)

ClassII + noCat order: 3 (1.68%)
ClassII + one order: 66 (36.87%)
ClassII potential chimeric*: 8 (4.47%)
ClassII total (DXX): 69 (38.55%)

PotentialHostGene total: 5 (2.79%)
SSR total: 1 (0.56%)

Nb Potential chimeric*: 16 (8.94%)
Nb noCat at class and order levels (noCat): 14 (7.82%)
	-------------------------Summary--------------------------------
RXX: 90 (50.28%)
DXX: 69 (38.55%)
PotentialHostGene: 5 (2.79%)
SSR: 1 (0.56%)
noCat: 14 (7.82%)
TOTAL: 179 (100.00%)

MCL clustering output files

-Clustering statistics (1st column [1,2 ..n] correspond to MCL clusters [MCL1, MCL2..MCLn]):
ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/ThalChr4_sim_denovoLibTEs_filtered_MCL_statsPerCluster.tab
cluster	sequencesNb	sizeOfSmallestSeq	sizeOfLargestSeq	averageSize	medSize
1	11	1295	11109	4268	3248
2	10	732	12810	5590	4408
3	6	614	2113	1641	1861
4	5	8905	15048	10914	9793
5	5	381	9694	4181	2336
-Clustering global statistics:
ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/ThalChr4_sim_denovoLibTEs_filtered_MCL_globalStatsPerCluster.txt
nb of clusters: 27
nb of clusters with 1 sequence: 0
nb of clusters with 2 sequences: 12
nb of clusters with >2 sequences: 15 (72 sequences)
nb of sequences: 96
nb of sequences in the largest cluster: 11
nb of sequences in the smallest cluster: 2
size of the smallest sequence: 381
size of the largest sequence: 15894
average sequences size: 2926
median sequences size: 1797
-Consensus Library with header containing the cluster name [MCL1, MCL2..MCLn]:
ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL/ThalChr4_sim_denovoLibTEs_filtered_MCL.fa
>RLX-incomp_MCL1_ThalChr4-B-G1-Map20_reversed
TCGAGTAGAGTCCTTTTAAGCTCCTTCTGCACCTGAAAACACACCAAAACATGCAATGTG
…
>RLX-incomp_MCL1_ThalChr4-B-G10-Map5
TGATGCCATTCCCTATCTATTAGAACCTGAACTAAATTTGCAATTATCATGTCTATGCAT
-Create a list (tabulated file) with 2 columns "Cluster_id TE_id":
cd ThalChr4_Blaster_GrpRecPil_Map_TEclassif_Filtered_MCL
gawk -F"_MCL|_ThalChr4" '{if(/>/){gsub(">","",$0);print "MCL\t"$2"\t"$1"_ThalChr4"$3}}' ThalChr4_sim_denovoLibTEs_filtered_MCL.fa \
| sort -nk2,2 \
| gawk -F"\t" '{print $1$2"\t"$3}' > ThalChr4_sim_denovoLibTEs_filtered_MCL.lst
ThalChr4_sim_denovoLibTEs_filtered_MCL.lst
MCL1	RLX-comp_ThalChr4-B-R139-Map20
MCL1	RLX-comp_ThalChr4-B-R51-Map20_reversed
MCL1	RLX-incomp_ThalChr4-B-G10-Map5
MCL1	RLX-incomp_ThalChr4-B-G1-Map20_reversed
MCL1	RLX-incomp_ThalChr4-B-G26-Map6
MCL1	RLX-incomp_ThalChr4-B-G55-Map3
MCL1	RLX-incomp_ThalChr4-B-G5-Map20
MCL1	RLX-incomp_ThalChr4-B-G60-Map11
MCL1	RLX-incomp_ThalChr4-B-G6-Map9
MCL1	RLX-incomp_ThalChr4-B-R163-Map3_reversed
MCL1	RLX-incomp_ThalChr4-B-R66-Map8_reversed
MCL2	RLX-comp_ThalChr4-B-G93-Map3_reversed
MCL2	RLX-incomp-chim_ThalChr4-B-G2-Map20_reversed
... 

TEannot most interesting output files

cd /home/trainee/ThalChr4/TEannot

Output directories

   ThalChr4_db				-step1: chunks and batches
   ThalChr4_TEdetect			-step2 to 7: Censor, RepeatMasker, Blaster on genome sequences and combined results
   ThalChr4_TEdetect_rnd		-step2 : Censor, RepeatMasker, Blaster on random genome sequences and threshold file
   ThalChr4_SSRdetect			-step4 & 5 : TRF, Mreps and RepeatMaskerSSR on genome sequences and combined SSR results
   ThalChr4_GFF3chr			-step8: A gff3 file for each genome sequence annotated
   ThalChr4_gameXMLchr			-step8: A gamexml file for each genome sequence annotated

Genome annotation file

ThalChr4_refTEs.gff
##gff-version 3
##sequence-region 4_CHROMOSOME 1 18585056
4_CHROMOSOME	ThalChr4_REPET_TEs	match	18250001	18250118	0.0	-	.	ID=ms2_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Target=DHX-incomp_ThalChr4-B-G181-Map3 342 458;TargetLength=782;TargetDescription=CI:30 coding:(TE_BLRtx: ATREP10D:ClassII:Helitron:Helitron: 100.00%) struct:(TElength: >700bps helitronExtremities: ATREP10D:ClassII:Helitron:Helitron: (0.0 | 4 | 779)) other:(SSRCoverage:0.23);Identity=84.4
4_CHROMOSOME	ThalChr4_REPET_TEs	match	18250935	18251291	0.0	-	.	ID=ms3_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Target=DHX-incomp_ThalChr4-B-G181-Map3 218 563;TargetLength=782;TargetDescription=CI:30 coding:(TE_BLRtx: ATREP10D:ClassII:Helitron:Helitron: 100.00%) struct:(TElength: >700bps helitronExtremities: ATREP10D:ClassII:Helitron:Helitron: (0.0 | 4 | 779)) other:(SSRCoverage:0.23)
4_CHROMOSOME	ThalChr4_REPET_TEs	match_part	18250935	18251134	0.0	-	.	ID=mp3-1_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Parent=ms3_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Target=DHX-incomp_ThalChr4-B-G181-Map3 370 563;Identity=74.2
4_CHROMOSOME	ThalChr4_REPET_TEs	match_part	18251248	18251291	0.0	-	.	ID=mp3-2_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Parent=ms3_4_CHROMOSOME_DHX-incomp_ThalChr4-B-G181-Map3;Target=DHX-incomp_ThalChr4-B-G181-Map3 218 260;Identity=74.2

Classification of TEdenovo consensus library corresponding to ThalChr4_refTEs.fa

gawk '{if(/>/){gsub(">","",$0);print}}' ThalChr4_refTEs.fa >ThalChr4_refTEs.lst
egrep -f ThalChr4_refTEs.lst ../TEdenovo/ThalChr4_Blaster_GrpRecPil_Map_TEclassif/classifConsensus/ThalChr4_sim_withoutRedundancy_negStrandReversed_WickerH.classif > ThalChr4_refTEs.classif

Genome annotation global statistics file

ThalChr4_chr_allTEs_nr_noSSR_join_path.globalAnnotStatsPerTE.txt
nb of sequences: 164
nb of matched sequences: 160
cumulative coverage: 2862833 bp
coverage percentage: 15.40%

total nb of TE fragments: 7393
total nb full-length fragments: 261 (3.53%)
total nb of TE copies: 6319
total nb full-length copies: 284 (4.49%)
families with full-length fragments: 93 (56.71%)
 with only one full-length fragment: 19
 with only two full-length fragments: 31
 with only three full-length fragments: 18
 with more than three full-length fragments: 25
families with full-length copies: 95 (57.93%)
 with only one full-length copy: 18
 with only two full-length copies: 27
 with only three full-length copies: 21
 with more than three full-length copies: 29
mean of median identity of all families: 82.76 +- 7.72
mean of median length percentage of all families: 22.12 +- 25.81

TE annotation statistics per consensus

ThalChr4_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE.tab
TE	length	covg	frags	fullLgthFrags	copies	fullLgthCopies	meanId	sdId	minId	q25Id	medId q75Id	maxId	meanLgth	sdLgth	minLgth	q25Lgth	medLgth	q75Lgth	maxLgth meanLgthPerc	sdL gthPerc	minLgthPerc	q25LgthPerc	medLgthPerc	q75LgthPerc	maxLgthPerc
DHX-incomp-chim_ThalChr4-B-G18-Map20	1907	17341	123	0	115	0	77.65	4.88	69.00	74.20	77.00	80.20	95.10	151.85	155.72	21	63.00	102.00	179.00	1152	7.96	8.1
7	1.10	3.30	5.35	9.39	60.41
…
RXX-LARD_ThalChr4-B-G100-Map9	11362	50882	45	4	44	4	81.32	9.76	72.70	74.30	76.20	88.00	100.00	1613.48	3554.16	83	137.00	216.00	297.00	11550	14.20	31.28	0.7
3	1.21	1.90	2.61	101.65
RXX-TRIM_ThalChr4-B-G115-Map7	630	39	1	0	1	0	87.50	0.00	87.50	87.50	87.50	87.50	87.50	39.00	0.00	39	39.00	39.00	39.00	39	6.19	0.00	6.1
9	6.19	6.19	6.19	6.19


Annexes

Additional commands

  • If you need to restart the REPET pipeline, you must delete all the folder created by REPET and clear the “jobs” table from the database
rm -r ThalChr4_*
mysql -h $REPET_HOST -u $REPET_USER -p$REPET_PW $REPET_DB/code>
<code>mysql> show tables;
mysql> select * from jobs;
mysql> delete from jobs;
mysql> quit
  • To delete all the tables: in case of relaunching all the 2 pipelines
ListAndDropTables.py -l "*" -C TEdenovo.cfg -d "*" -v 3
->Deleting TEannot tables in case of relaunching only TEannot
ListAndDropTables.py -l "ThalChr4_chk_" -d "ThalChr4_chk_"
->Deleting 9 tables corresponding to 'ThalChr4_chk_'
ListAndDropTables.py -l "ThalChr4_chr_" -d "ThalChr4_chr_"
>Deleting 4 tables corresponding to 'ThalChr4_chr_'
ListAndDropTables.py -l "ThalChr4_refTEs" -d "ThalChr4_refTEs"
->Deleting 2 tables corresponding to 'ThalChr4_refTEs'