ProSeCDA Overall procedure#

ProSeCDA pipeline overview.
Figure 2: Pipeline overview of ProSeCDA

ProSeCDA is a tool that allows to search in a given protein database multiple proteins of interest defined by a user-specified set of domains.

The first step in ProSeCDA is to annotate a protein database of interest with domains from a user-specified HMM profile database. In the next step the annotated proteins are filtered following a set of rules defined by the user. The rule consists to define one or a set of protein families based on the presence and/or absence of domains. All proteins matching thoses rules are then finally accessible in the output.

Inputs#

ProSeCDA requires three inputs:

  • a protein database in a fasta format in which the user wants to search proteins of interest based on their domain architecture.
  • an HMM domain profile database used to annotate the input protein database. Any HMM domain profile database can be used as long as it is compatible with hmmersearch.
  • a YAML file containing the rules, that is a set of protein families defined by the presence and/or absence of specific domain(s). The format of the YAML file is described in Usage guideline. Please note that domains used to define proteins on interest must be present in the HMM domain profile database.

Annotation#

Selection of matching domains#

The annotation step is required to assign domains from the HMM profile database matching sequences in the protein database. The annotation procedure uses hmmsearch to search for each of the HMM domain profile present in the YAML rules file against the protein database of interest. All matching sequences with an E-valuea less than 0.01 (default value) and an expected accuracy per residue of the alignmentb above or equal to 0.6 (default value) are then retrieved.

Note

  • a Both conditional and independent E-values from hmmer are evaluated
  • b Please see the hmmer documentation for more details about the accuracy

Resolving overlapping domains#

To resolve overlapping domainsa, an approach similar to the heaviest weighted clique-finding method described in Orengo et al. is used. When multiple matching domains are found for a protein sequence and some of those domains overlap, all possible domain architectures defined by a set of non-overlapping domains are identified, with each domain being assigned a score corresponding to -log(E-value). An alternative score is also used in case the E-value associated with a match is equal to 0.0. In that case, the bit score from hmmer is assigned to each domain instead of the -log(E-value) which cannot be computed. The protein is then assigned the most-likely domain architecture which is defined as the combination of non-overlapping domains that gives the highest total score.

Note

  • a We consider that two domains are overlapping if at least 40% of the shortest domain sequence overlap with the other domain.

Filtering#

The filtering step searches in all precedently annotated proteins the domain architectures matching those in the set of rules defined by the user. A match with a user-defined family is valid for a protein if its most-likely domain architecture fits the mandatory domains and if no forbidden domains are present. Moreover, if an E-value threshold is specified in the rules for a given mandatory domain, this domain must match with an E-value at least below this threshold.

Output#

The output of ProSeCDA has the following architecture:

prosecda_yyyy-mm-dd_hh-mm-ss/
├── protein_database_filename.domtblout
├── info.log
└── results/
    ├── family_A1/
    │   ├── all_matching_proteins.pdf
    │   ├── protein_id_1.fa
    │   ├── protein_id_1.pdf
    │   ├── protein_id_1.xml
    │   ├── protein_id_2.fa
    │   ├── protein_id_2.pdf
    │   └── protein_id_2.xml
    ├── family_A2/
    │   ├── all_matching_proteins.pdf
    │   ├── protein_id_8.fa
    │   ├── protein_id_8.pdf
    │   ├── protein_id_8.xml
    │   └── ...
    ├── ...

The table below describes the content of the output directory:

File/Directory name Description
protein_database_filename.domtblout hmmsearch output in domtblout format
info.log Summary log of the computation
family_A1/ Subdirectory containing detailed files for proteins matching the rule defining 'family A1'
all_matching_proteins.pdf Figures of all proteins matching the family rule with their most-likely domain architecture
protein_id_1.fa Fasta sequence of the protein
protein_id_1.pdf Figure of the protein with all matching domains
protein_id_1.xml XML file describing the protein