IterHMMBuild overall procedure#

IterHMMBuild pipeline overview.
Figure 2: Pipeline overview of IterHMMBuild

IterHMMBuild is an iterative search method based on the hmmer program, the aim of which is to provide users with a representative HMM protein profile of interest constructed by an iterative enrichment process starting from a small initial set of related protein sequences.

The IterHMMBuild procedure starts building an HMM profile from either a set of related protein sequences or a single query sequence. This initial profile is then used to identify homologous sequences in any user-specified protein sequence database. If sequences are found, they are added to the initial query sequences and a new HMM profile model is built. This process is repeated until no new sequences are found (i.e. convergence is reached).

Inputs#

Two inputs are required for IterHMMBuild.

The first input is either a fasta file with at least one protein sequence (Figure 2) OR a directory location where multiple individual fasta files are stored (Figure 1). In the first case, the output will contain an HMM profile representative of the sequences in the fasta file given as input; if a directory is given as input then the HMM profile in the output will be a concatenation of HMM profiles, each corresponding to the fasta files present in the directory.

The second input is a fasta file of protein database used to enrich initial protein sequence(s) of interest.

HMM building step#

The HMM building is a two-step procedure: a multiple sequence alignment is performed on the input sequences using muscle and the hmmbuild command from hmmer is then used to build the HMM profile from this alignment. However, note that at the first iteration the usearch command is performed on the input sequences to ensure that thoses sequences share no more than 90% (default value) of identity.

Sequence enrichment step#

The previously built HMM profile is searched against the protein database given as input using the hmmsearch command from hmmer. All matching sequences with a E-valuesa less than 0.01 (default value) and an expected accuracy per residue of the alignmentb above or equal to 0.6 (default value) are retrieved. Those sequences are then merged to the initial input sequences. To ensure that sequences are not redundant, usearch is applied with a threshold identity value of 0.90.

Note

  • a Both conditional and independent E-values from hmmer are evaluated
  • b Please see the hmmer documentation for more details about the accuracy

Convergence#

Basically, the convergence is reached when the number of sequences at iteration i+1 (Nseqi+1) is strictly equal to the number of sequences at iteration i (Nseqi). However, because the number of new sequences found after multiple consecutive iterations can be really low, or even negative, a counter is also used to evaluate the convergence status in order to prevent unnecessay iterations (i.e. iterations that will increase the computation time without significantly enrich the number of new sequences). This counter is incremented each time the difference between Nseqi+1 and Nseqi is negative or equal to 1 (default value). The convergence is then also reached when the value of this counter is equal to 3 (default value).

Outputs#

An example output is as shown below:

iterhmmbuild_2020-10-29_13-13-04/
├── X.hmm
├── X_seed.clw
├── X_seed.fa
├── info.log
├── iter_1
│   ├── X_enriched_nr.fasta
│   ├── X_nr.clw
│   ├── X_nr.domtblout
│   ├── X_nr.fa
│   └── X_nr.hmm
├── ...
└── iter_6
    ├── X_enriched_nr.clw
    ├── X_enriched_nr.domtblout
    ├── X_enriched_nr.fasta
    └── X_enriched_nr.hmm

The table below describes the content of the output directory:

File name Description
X.hmm Final HMM profile
X_seed.fa Final sequences used to build X.hmm
X_seed.clw Multiple alignment (clustal W format) of X_seed.fa
info.log Log summary of the computation
iter_1/X_nr.fa Non redundant sequences coming from usearch applied on the input X.fa (see IterHMMBuild procedure)
iter_1/X_nr.clw Multiple alignment of sequences in X_nr.fa
iter_1/X_nr.hmm HMM profile built with X_nr.clw as input
iter_1/X_nr.domtblout Output file of the hmmsearch command
iter_1/X_enriched_nr.fasta Non redundant sequences coming from homologous sequences identified with hmmsearch and initial sequences in X.fa