IterHMMBuild usage guideline#

Tip

To guide the user, input data examples can be found in cusProSe/iterhmmbuild/datas/:

cusProSe/iterhmmbuild/datas/
    ├── inputs
    │   ├── A.fa
    │   ├── KS.fa
    │   └── PP.fa
    └── mgg_70-15_8.fasta

All fasta files in the inputs/ directory contain sequences of three different protein domains. There is also the magnaporthe orizae proteome (mgg_70-15_8.fasta) that will be used as the protein database for the examples below.

Quick start examples#

Build a single HMM profile#

Command to build a single HMM profile from the magnaporthe orizae proteome and representative of the A domain sequences:

iterhmmbuild -fa inputs/A.fa -protdb mgg_70-15_8.fasta

Build an HMM profile database#

Command to build an HMM profile database specific to the magnaporthe orizae proteome with each profiles representative of the domain sequences present in the directory `inputs/`:

iterhmmbuild -fa inputs/ -protdb mgg_70-15_8.fasta

Note

Please note that the user can also create an HMM profile database from a set of individual HMM profiles through the command create_hmmdb. Those HMM profiles must be placed in a unique directory that is given as an input.

For instance, let's say we have the following directory containing three HMM profiles:

my_hmm_dir/
        ├── A.hmm
        ├── KS.hmm
        └── PP.hmm

Then the following command will generate an HMM profile database (that is a simple concatenation of the three HMM profiles) called mydb.hmm in an output directory named databases/:

create_hmmdb -hmmdir my_hmm_dir/ -dbname mydb.hmm -outdir databases/

Command line and parameters#

As illustrated by the commands above, two arguments are mandatory for IterHMMBuild: -fa and -protdb.

The input following -fa can be either a fasta file with at least one protein sequence OR a directory location where multiple individual fasta files are stored. The former can be used when the user wants to build a single HMM profile representative of the sequence(s) given as input. The latter should be used when the user wants to build a set of HMM profiles concatenated into an HMM profile database, with each HMM profile being representative of each related fasta files present in the directory given as input.
The input following -protdb is a fasta file of the protein database used to enrich initial protein sequence(s) of interest.

Help about the usage of IterHMMBuild and its parameters can be shown with the following command: iterhmmbuild -h

usage: iterhmmbuild [-h] -fa [FA] -protdb [PROTDB] [-name [NAME]] [-out [OUT]] [-id ID] 
                    [-cov COV] [-cval CVAL] [-ival IVAL] [-acc ACC] [-delta DELTA]
                    [-maxcount MAXCOUNT]

Iterative building of hmm profiles

optional arguments:
  -h, --help          show this help message and exit
  -fa [FA]            Fasta file of sequence(s) used as first seed or directory containing such files
  -protdb [PROTDB]    Sequences used to learn the hmm profile (fasta format)
  -name [NAME]        Name for the HMM profile (fasta name by default).
  -out [OUT]          Output directory
  -id ID              Sequence identity threshold to remove redundancy in seeds'sequences (0.9)
  -cov COV            Minimum percentage of coverage alignment between hmm hit and hmm profile (0.0)
  -cval CVAL          HMMER conditional e-value cutoff (0.01)
  -ival IVAL          HMMER independant e-value cutoff (0.01)
  -acc ACC            HMMER mean probability of the alignment accuracy between each residues of the target and the 
                      corresponding hmm state (0.6)
  -delta DELTA        Convergence criteria: difference in the number of sequences found between two consecutive iterations            
                      to consider a non-significant change between between two consecutive iterations (1)
  -maxcount MAXCOUNT  Convergence criteria: maximum number of times a non-significant change (conv_delta) is accepted before
                      considering a convergence (3)

Output of IterHMMBuild#

After running IterHMMBuild an output directory will be generated in the following generic format: iterhmmbuild_year-month-day_hour-min-sec/

Output from the generation of a single HMM profile#

The output directory generated from the command run in the quick start examples will have the following architecture:

iterhmmbuild_2020-10-29_13-13-04/
├── A.hmm
├── A_seed.clw
├── A_seed.fa
├── info.log
├── iter_1/
├── iter_2/
├── ...
└── iter_6/

The three main files of interest are:

A.hmm	Final HMM profile
A.seed.clw	Final sequences used to build A.hmm
A.seed.fa	Multiple alignment (clustal W format) of A_seed.fa

info.log is a log summary of the computation. The subdirectories iter_i/ contain files obtained at each iteration and are described in section Overall procedure.

Output from the generation of an HMM profile database#

The output directory generated from the command run in the quick start examples will be a list of subdirectories such as the output described above. You will find at its root the file hmm_database.hmm, a concatenation of the HMM profiles of protein domains used as inputs.

iterhmmbuild_2021-03-02_12-39-38
├── hmm_database.hmm
├── info.log
├── A
│   ├── A.hmm
│   ├── A_seed.clw
│   └── A_seed.fa
├── KS
│   ├── KS.hmm
│   ├── KS_seed.clw
│   └── KS_seed.fa
└── PP
    ├── PP.hmm
    ├── PP_seed.clw
    └── PP_seed.fa