IterHMMBuild usage guideline#

Tip

To guide the user, input data examples can be found in cusProSe/iterhmmbuild/datas/:

cusProSe/iterhmmbuild/datas/
    ├── inputs
    │   ├── A.fa
    │   ├── KS.fa
    │   └── PP.fa
    └── mgg_70-15_8.fasta
    
All fasta files in the inputs/ directory contain sequences of three different protein domains. There is also the magnaporthe orizae proteome (mgg_70-15_8.fasta) that will be used as the protein database for the examples below.

Quick start examples#

Build a single HMM profile#

Command to build a single HMM profile from the magnaporthe orizae proteome and representative of the A domain sequences:
iterhmmbuild -fa inputs/A.fa -protdb mgg_70-15_8.fasta

Build an HMM profile database#

Command to build an HMM profile database specific to the magnaporthe orizae proteome with each profiles representative of the domain sequences present in the directory `inputs/`:
iterhmmbuild -fa inputs/ -protdb mgg_70-15_8.fasta

Note

Please note that the user can also create an HMM profile database from a set of individual HMM profiles through the command create_hmmdb. Those HMM profiles must be placed in a unique directory that is given as an input.

For instance, let's say we have the following directory containing three HMM profiles:

my_hmm_dir/
        ├── A.hmm
        ├── KS.hmm
        └── PP.hmm
        
Then the following command will generate an HMM profile database (that is a simple concatenation of the three HMM profiles) called mydb.hmm in an output directory named databases/:

create_hmmdb -hmmdir my_hmm_dir/ -dbname mydb.hmm -outdir databases/

Command line and parameters#

As illustrated by the commands above, two arguments are mandatory for IterHMMBuild: -fa and -protdb.

  • The input following -fa can be either a fasta file with at least one protein sequence OR a directory location where multiple individual fasta files are stored. The former can be used when the user wants to build a single HMM profile representative of the sequence(s) given as input. The latter should be used when the user wants to build a set of HMM profiles concatenated into an HMM profile database, with each HMM profile being representative of each related fasta files present in the directory given as input.
  • The input following -protdb is a fasta file of the protein database used to enrich initial protein sequence(s) of interest.

Help about the usage of IterHMMBuild and its parameters can be shown with the following command: iterhmmbuild -h

usage: iterhmmbuild [-h] -fa [FA] -protdb [PROTDB] [-name [NAME]] [-out [OUT]] [-id ID] 
                    [-cov COV] [-cval CVAL] [-ival IVAL] [-acc ACC] [-delta DELTA]
                    [-maxcount MAXCOUNT]

Iterative building of hmm profiles

optional arguments:
  -h, --help          show this help message and exit
  -fa [FA]            Fasta file of sequence(s) used as first seed or directory containing such files
  -protdb [PROTDB]    Sequences used to learn the hmm profile (fasta format)
  -name [NAME]        Name for the HMM profile (fasta name by default).
  -out [OUT]          Output directory
  -id ID              Sequence identity threshold to remove redundancy in seeds'sequences (0.9)
  -cov COV            Minimum percentage of coverage alignment between hmm hit and hmm profile (0.0)
  -cval CVAL          HMMER conditional e-value cutoff (0.01)
  -ival IVAL          HMMER independant e-value cutoff (0.01)
  -acc ACC            HMMER mean probability of the alignment accuracy between each residues of the target and the 
                      corresponding hmm state (0.6)
  -delta DELTA        Convergence criteria: difference in the number of sequences found between two consecutive iterations            
                      to consider a non-significant change between between two consecutive iterations (1)
  -maxcount MAXCOUNT  Convergence criteria: maximum number of times a non-significant change (conv_delta) is accepted before
                      considering a convergence (3)

Output of IterHMMBuild#

After running IterHMMBuild an output directory will be generated in the following generic format: iterhmmbuild_year-month-day_hour-min-sec/

Output from the generation of a single HMM profile#

The output directory generated from the command run in the quick start examples will have the following architecture:

iterhmmbuild_2020-10-29_13-13-04/
├── A.hmm
├── A_seed.clw
├── A_seed.fa
├── info.log
├── iter_1/
├── iter_2/
├── ...
└── iter_6/

The three main files of interest are:

A.hmm Final HMM profile
A.seed.clw Final sequences used to build A.hmm
A.seed.fa Multiple alignment (clustal W format) of A_seed.fa

info.log is a log summary of the computation. The subdirectories iter_i/ contain files obtained at each iteration and are described in section Overall procedure.

Output from the generation of an HMM profile database#

The output directory generated from the command run in the quick start examples will be a list of subdirectories such as the output described above. You will find at its root the file hmm_database.hmm, a concatenation of the HMM profiles of protein domains used as inputs.

iterhmmbuild_2021-03-02_12-39-38
├── hmm_database.hmm
├── info.log
├── A
│   ├── A.hmm
│   ├── A_seed.clw
│   └── A_seed.fa
├── KS
│   ├── KS.hmm
│   ├── KS_seed.clw
│   └── KS_seed.fa
└── PP
    ├── PP.hmm
    ├── PP_seed.clw
    └── PP_seed.fa