IterHMMBuild usage guideline#
Tip
To guide the user, input data examples can be found in cusProSe/iterhmmbuild/datas/
:
cusProSe/iterhmmbuild/datas/ ├── inputs │ ├── A.fa │ ├── KS.fa │ └── PP.fa └── mgg_70-15_8.fastaAll fasta files in the
inputs/
directory contain sequences of three different protein domains. There is also the magnaporthe orizae proteome (mgg_70-15_8.fasta
) that will be used as the protein database for the examples below.
Quick start examples#
Build a single HMM profile#
iterhmmbuild -fa inputs/A.fa -protdb mgg_70-15_8.fasta
Build an HMM profile database#
iterhmmbuild -fa inputs/ -protdb mgg_70-15_8.fasta
Note
Please note that the user can also create an HMM profile database from a set of individual HMM profiles through the command create_hmmdb
. Those HMM profiles must be placed in a unique directory that is given as an input.
For instance, let's say we have the following directory containing three HMM profiles:
my_hmm_dir/ ├── A.hmm ├── KS.hmm └── PP.hmmThen the following command will generate an HMM profile database (that is a simple concatenation of the three HMM profiles) called
mydb.hmm
in an output directory named databases/
:
create_hmmdb -hmmdir my_hmm_dir/ -dbname mydb.hmm -outdir databases/
Command line and parameters#
As illustrated by the commands above, two arguments are mandatory for IterHMMBuild: -fa
and -protdb
.
-
The input following
-fa
can be either a fasta file with at least one protein sequence OR a directory location where multiple individual fasta files are stored. The former can be used when the user wants to build a single HMM profile representative of the sequence(s) given as input. The latter should be used when the user wants to build a set of HMM profiles concatenated into an HMM profile database, with each HMM profile being representative of each related fasta files present in the directory given as input. -
The input following
-protdb
is a fasta file of the protein database used to enrich initial protein sequence(s) of interest.
Help about the usage of IterHMMBuild and its parameters can be shown with the following command: iterhmmbuild -h
usage: iterhmmbuild [-h] -fa [FA] -protdb [PROTDB] [-name [NAME]] [-out [OUT]] [-id ID] [-cov COV] [-cval CVAL] [-ival IVAL] [-acc ACC] [-delta DELTA] [-maxcount MAXCOUNT] Iterative building of hmm profiles optional arguments: -h, --help show this help message and exit -fa [FA] Fasta file of sequence(s) used as first seed or directory containing such files -protdb [PROTDB] Sequences used to learn the hmm profile (fasta format) -name [NAME] Name for the HMM profile (fasta name by default). -out [OUT] Output directory -id ID Sequence identity threshold to remove redundancy in seeds'sequences (0.9) -cov COV Minimum percentage of coverage alignment between hmm hit and hmm profile (0.0) -cval CVAL HMMER conditional e-value cutoff (0.01) -ival IVAL HMMER independant e-value cutoff (0.01) -acc ACC HMMER mean probability of the alignment accuracy between each residues of the target and the corresponding hmm state (0.6) -delta DELTA Convergence criteria: difference in the number of sequences found between two consecutive iterations to consider a non-significant change between between two consecutive iterations (1) -maxcount MAXCOUNT Convergence criteria: maximum number of times a non-significant change (conv_delta) is accepted before considering a convergence (3)
Output of IterHMMBuild#
After running IterHMMBuild an output directory will be generated in the following generic format:
iterhmmbuild_year-month-day_hour-min-sec/
Output from the generation of a single HMM profile#
The output directory generated from the command run in the quick start examples will have the following architecture:
iterhmmbuild_2020-10-29_13-13-04/ ├── A.hmm ├── A_seed.clw ├── A_seed.fa ├── info.log ├── iter_1/ ├── iter_2/ ├── ... └── iter_6/
The three main files of interest are:
A.hmm | Final HMM profile |
A.seed.clw | Final sequences used to build A.hmm |
A.seed.fa | Multiple alignment (clustal W format) of A_seed.fa |
info.log
is a log summary of the computation. The subdirectories iter_i/
contain files obtained at each iteration and are described in section Overall procedure.
Output from the generation of an HMM profile database#
The output directory generated from the command run in the quick start examples will be a list of subdirectories such as the output described above. You will find at its root the file hmm_database.hmm
, a concatenation of the HMM profiles of protein domains used as inputs.
iterhmmbuild_2021-03-02_12-39-38 ├── hmm_database.hmm ├── info.log ├── A │ ├── A.hmm │ ├── A_seed.clw │ └── A_seed.fa ├── KS │ ├── KS.hmm │ ├── KS_seed.clw │ └── KS_seed.fa └── PP ├── PP.hmm ├── PP_seed.clw └── PP_seed.fa