ProSeCDA usage guideline#
Tip
To guide the user, input data files can be found in cusProSe/prosecda/datas/
:
cusProSe/prosecda/datas/ ├── database.hmm ├── mgg_70-15_8.fasta └── rules.yaml
-
Those are the three mandatory files required as inputs to run prosecda:
rules.yaml
is a YAML file in which proteins of interest are described based on the presence/absence of specific domainsdatabase.hmm
is a customized HMM profile database made of 24 HMM protein domain modelsmgg_70-15_8.fasta
is the magnaporthe orizae proteome that will be scanned against proteins of interest defined in the YAML rules file
Quick start example#
rules.yaml
(the HMM profile database database.hmm
is used to find domains of interest matching proteins in the magnaporthe orizae proteome):
prosecda -proteome mgg_70-15_8.fasta -hmmdb database.hmm -rules rules.yaml
Help about the usage of ProSeCDA and its parameters can be shown with the following command: prosecda -h
:
usage: prosecda [-h] -proteome [PROTEOME] -hmmdb [HMMDB] -rules [RULES] [-out [OUT]] [-cov COV] [-cevalue CEVALUE] [-ievalue IEVALUE] [-score SCORE] [-acc ACC] [--nopdf] Search proteins matching rules. optional arguments: -h, --help show this help message and exit -proteome [PROTEOME] Proteome file (.fasta) -hmmdb [HMMDB] HMM profile database -rules [RULES] Rules'file (.yaml) -out [OUT] Output directory -cov COV Minimum ratio between the length of the HMM profile stretch that matches a sequence and the overall length of the HMM profile (0.0) -cevalue CEVALUE HMMER conditional e-value cutoff (0.01) -ievalue IEVALUE HMMER independant e-value cutoff (0.01) -score SCORE HMMER score cutoff (3.0) -acc ACC HMMER mean probability of the alignment accuracy between each residues of the target and the corresponding hmm state (0.6) --nopdf Deactivate the generation of the pdf results (False)
Creating rules#
The proteins the user is interested in must be described in a file containing simple rules defining for each protein of interest which domains are mandatory and/or which one are forbidden. An E-value (corresponding to the independent domain E-value from hmmsearch) can also be specified for each mandatory domains. In that case, a mandatory domain must match a sequence with an E-value at least below the one specified. If no E-value is defined for mandatory domains, an E-value of 0.01 is used by default.
Editing rules from GUI#
The file in which rules are edited must respect a specific syntax in a YAML format. To make it simpler, a Graphical User Interface (GUI) has been implemented. You can access it through the command create_rules

As shown in the above image, a rule can be defined by five different inputs:
Name
(required): category/family name the proteins that'll match the rule will be assigned toComment
(optional): can be used to describe the ruleMandatory
(required): domain name the protein must containE-value
(optional): minimal E-value the associated domain must match a sequence (0.01 if not specified)Forbidden
(optional): domain name the protein must not contain
Once you have added all your mandatory/forbidden domains describing the domain architecture your proteins of interest must match, click on Add rule to the list
to validate the rule. You'll see it listed in the Rule list
panel. Repeat the procedure to create as many rules as you want.
Once all your desired rules are defined, click on Save rules
to save them in a YAML file. This YAML file will look like the one described below:
PKS: COMMENT: Polyketide Synthase CONDITION: mandatory: - KS - AT - PP-binding, 0.0001 forbidden: - C PKS-like: COMMENT: Polyketide Synthase like CONDITION: mandatory: - KS - AT forbidden: PKS_type3.0: COMMENT: Polyketide Syntase type III CONDITION: mandatory: - CHS_like forbidden:
Updating an existing rule file#
If the user wants to update an already existing rule file, he will have to manually edit it. In that case be aware that a YAML file must respect a specific syntax. The simplest way to add new rules in an already existing file might to copy paste an exisitng rule and to adapt its content. Note that all characters in blue in the above YAML file are required, and whitespace indentations must be respected.
Output of ProSeCDA#
Overall architecture#
The output directory generated from the command run in the Quick start example will be as described below:
prosecda_year-month-day_hour-min-sec/ ├── css/ ├── images/ ├── index.html ├── info.log ├── js/ ├── mgg_70-15_8.domtblout └── results/
with:
index.html
: an interactive web page allowing to visualize the results (described here)css/
,js/
andimages/
: folders containing files read by index.htmlinfo.log
: summary log of the computation runmgg_70-15_8.domtblout
: output file of hmmsearchresults/
: directory containing different output files for rules matching proteins
Content of the results/ directory#
The results/
directory contains a list of subdirectories, each one corresponding to a user-defined protein family for which match has been found (DMATS/
, Ent_kaurene_synthase/
...):
results/ ├── DMATS/ │ ├── all_DMATS.pdf │ ├── MGG_06540T0.fa │ ├── MGG_06540T0.pdf │ ├── MGG_06540T0.xml │ ├── MGG_10953T0.fa │ ├── MGG_10953T0.pdf │ ├── MGG_10953T0.xml │ ├── MGG_12480T0.fa │ ├── MGG_12480T0.pdf │ └── MGG_12480T0.xml ├── Ent_kaurene_synthase/ │ ├── all_Ent_kaurene_synthase.pdf │ ├── MGG_01949T0.fa │ ├── MGG_01949T0.pdf │ ├── MGG_01949T0.xml | ... ...
Inside each protein family folder, three files with the related protein ID as the basename (e.g. MGG_06540T0) are generated for each protein matching the user-defined family rule:
protein_ID.fa
: the protein sequence in fasta formatprotein_ID.xml
: details in XML format about the proteins and its matching domainsprotein_ID.pdf
: a graphical representation of all of the domains that matched the protein
The file whose basename follows the syntax all_xxx.pdf
(where xxx stands for the family name) is a graphical representation of the most-likely domains architecture for all the proteins that have been found to match the user-defined family rule. Please note that for a given protein the graphical representation in this file will be different from the one found in protein_ID.pdf
, the latter representing all the domains that have been found to match the protein. Some of those domains might not be visible in the most-likely domain architecture. This will be the case for a domain i that overlaps with a domain j and if any architecture in which domain i is present is not the most-likely architecture, that is the one with the highest score (see here for details about how the score is assigned).
Illustration of the output files#
The all_xxx.pdf file#
The file whose basename follows the syntax all_xxx.pdf
is a multipage pdf file containing graphical representations of the most-likely domain architecture of all the proteins that have been found to match the user-defined family rule.
An example of the first page of all_PKS.pdf
is shown below:

The first line of the figure represents all domains composing the most-likely domain architecture found for the protein MGG_00233T0. The next lines represent details information about each domain, namely its amino acid positions, name, independant e-value (i-evalue) and its score.
The protein_ID.pdf file#
The graphical representation found for a given protein in the file protein_ID.pdf
does not show any domain architecture; instead it represents all the domains that matched the protein sequence during the annotation step. The main function of this file is to keep track of domains that may have not been retained to be part of the most-likely architecture. It thus provides the user a way to check the validity for a protein to really belong to the assigned protein family.
To illustrate that point, let's compare both graphical representations for the protein MGG_04631T0 that has been assigned to the Sesquiterp_synthase family:

all_Sesquiterp_synthase.pdf
and (B) MGG_04631T0.pdf
.
The figure 4A shows the most-likely domain architecture assigned to the protein MGG_04631T0 (found in all_Sesquiterp_synthase.pdf
) which is composed of only one domain: the sesquiterpene domain.
The figure 4B shows all the domains that matched the protein sequence during the annotation step. We can see in this representation that both the sesquiterpene domain and the fusicocadiene domain matched the protein sequence. Only the sesquiterpene domain has been retained in the most-likely domain architecture based on its most favorable i-evalue and thus made the protein assigned as an Sesquiterp_synthase.
The protein_ID.xml file#
The protein_ID.xml file contains the main information about the protein. An example of the content of this xml file is shown below for the protein MGG_04631T0:
<protein>
<id>MGG_04631T0</id>
<sequence>MADRTCVTLPDMFRGFIVQEPKVNKHYEAVKPVSEKWLARICAFSPMMQKRVGACDFSYFCSIAAPEAPIHKLRTMCDWGNWVFPFDDMFDSGDLRSDLIVTRHVLDSLLADMKGHKFRGLKIPVVLAHDDIYRRLSEVETKNPSISGVQRRFARAMELYALGVAQHVQDFTESGLPCPQEMLETRRLSVGVAPLYHLVEYAHSIRLPDEVFEDPAIQTLERLGADMVIMSNDMLSYQKEEASTNNEGCPFNMVAACRMSGQSAQEAFDTVGALLEESYYEWEETMRQVPARGGDVERDVQRYIKGIQDVVQANITWSFRSKRYLGVHAPEVRRTKKFDVMTHPPYLDKDMAELR</sequence>
<sequence_length>356</sequence_length>
<class_name>Sesquiterp_synthase</class_name>
<most_likely_architecture>
<domain name="sesquiterpene">
<c-evalue>2e-36</c-evalue>
<i-evalue>3.3e-33</i-evalue>
<score>114.4</score>
<start>4</start>
<end>333</end>
<domain_length>330</domain_length>
</domain>
</most_likely_architecture>
<other_matching_domains>
<domain name="fusicocadiene">
<c-evalue>7.9e-06</c-evalue>
<i-evalue>0.0093</i-evalue>
<score>13.5</score>
<start>43</start>
<end>343</end>
<domain_length>301</domain_length>
</domain>
</other_matching_domains>
</protein>
Please note that if domains matched the protein sequence during the annotation step but don't belong to the most-likely architecture, those domains will be listed in the xml file under the tag <other_matching_domains>
.