Running example#

In this section, we present several examples of the use of the different programs of ORFmine:

Get started: downloading the examples and launching the container
Annotation and extraction of all the ORFs (coding and noncoding) of a genome with ORFtrack and its companion program ORFget
Prediction of the fold potential of all noncoding ORFs of a genome with ORFold
Characterization of the translation activity of all intergenic ORFs of a genome with ORFribo
Estimation of the ages of all the CDSs of a genome with ORFdate

Get started!#

To guide the user, all command line examples will rely on the examples directory that can be found here. If you want to follow the guidelines, please follow the procedure below:

# create a directory that will host the examples archive and go in it
mkdir ~/orfmine_tutorial
cd orfmine_tutorial

# download the example archive if not already done
wget https://zenodo.org/records/15056381/files/ORFmine.zip?download=1

# otherwise, if you have already downloaded the example archive, place it in the orfmine_tutorial/ directory

# untar the archive
unzip ORFmine.zip

# go into the example directory 
cd ORFmine/examples

Important: All ORFmine commands must include either --singularity or --docker, depending on your setup.These flags ensure that ORFmine executes those tools through the appropriate container.

Annotation and extraction of ORFs with ORFtrack#

Annotation of ORFs#

ORFtrack takes as input (i) a fasta file containing the nucleotide sequences of all chromosomes or contigs, and (ii) the corresponding annotations in a gff file (see the GFF3 documentation for more details on the GFF3 format). Please notice that we strongly recommend using GFF3 formats. The following command runs ORFtrack on the complete genome of Saccharomyces cerevisiae (Scer) available in the ORFmine/examples folder:

orftrack --fna database/Scer.fna --gff database/Scer.gff --singularity

ORFtrack generates a novel gff file mapping_orf_Scer.gff that contains the annotations of all the identified ORFs (see here for more details on the annotation process). A copy of the output can be found in the (ORFmine/examples/database/ folder). Please note that ORFtrack also generates a summary.log file that lists all the ORF categories it has identified.

Extraction of the sequences of noncoding ORFs in a fasta format#

The amino acid sequences of all annotated ORFs or specific subsets of ORFs (i.e. only noncoding intergenic ORFs for example) can be extracted and written in a fasta file with ORFget, a tool provided with ORFtrack. The following instruction writes the amino acid sequences of all yeast noncoding ORFs (including noncoding intergenic ORFs and those that overlap with a genomic feature). ORFget therefore needs the genome sequence (nucleotides) from which the ORF sequences will be extracted and translated into amino acids and the ORFtrack output gff file with the annotations and coordinates of all the identified ORFs. The "--features_include" option indicates the features to be matched in the names of the ORFs that we want to extract. In this case, we are interested in all noncoding ORFs (i.e. all ORFs containing the flag "nc" - see here for more details on the annotation rules of the ORFs), so we indicate it to ORFget through the "--features_include nc" option (see here for more examples).

orfget --fna database/Scer.fna --gff database/mapping_orf_Scer.gff --features_include nc --singularity

ORFget produces a fasta file (mapping_orf_Scer_nc.pfasta) containing the amino acid sequences of all noncoding ORFs annotated by ORFtrack (see here for more examples on the use of ORFget). The output is a fasta file is stored the current directory or in the defined directory using --out based on the gff file rootname and the include features. The user can also extract the nucleotide sequences through the "--type" option (--type nucl) or both amino acid and nucleotide sequences (--type both).

Extraction of the sequences of the CDSs of yeast in a fasta format#

In addition, ORFget enables the reconstruction of all coding sequences (CDSs) of the input genome (in this examlpe, Scer) according to their annotation in the original gff file. When different isoforms are described, it will generate all of them according to the gff file. It, therefore, needs as input the original gff file (not the one created by ORFtrack which only contains STOP-to-STOP ORFs) and the genome sequence in FASTA format.

orfget --fna database/Scer.fna --gff database/Scer.gff --features_include CDS --singularity

An example of this output is available in the ORFmine/examples/database/ directory.

Characterization of the fold potential and disorder and aggregation propensities of the noncoding ORFs with ORFold#

Running ORFold#

ORFold only needs a fasta file containing the amino acid sequences to characterize. In the following example, ORFold will estimate the foldability and the disorder and aggregation propensities, all the amino acid sequences potentially encoded in the noncoding ORFs of yeast. ORFold doesn't require

orfold --faa database/mapping_orf_Scer_nc.pfasta  --options H --singularity

with --options H to indicate that we want to estimate the foldability (H for HCA).

ORFold generates one table in the current directory or in the output directory defined using --out argument. A copy of the output can be found in ORFmine/examples/workdir/orfold directory. The table contains for each input sequence, the fold potential as calculated by HCA [1] (see here for more details on the calculation of this score).

In addition, a gff file containing the annotation of the input sequences can be provided as well. In this case, ORFold writes the fold potential and disorder and aggregation propensities in the 9th column of three gff files respectively. The latter can be subsequently uploaded on a genome viewer like IGV for a visual inspection of the distribution of these properties along the genome.

The following instruction enables the user to probe the fold potential and disorder and aggregation propensities of all the yeast noncoding ORFs annotated by ORFtrack and takes as inputs the corresponding fasta file and the gff file generated by ORFtrack (mapping_orf_Scer.gff located in /database/).

orfold --faa database/mapping_orf_Scer_nc.pfasta --options H --gff database/mapping_orf_Scer.gff --singularity

In addition to the score table, ORFold has generated a new gff file containing for each ORF its fold potential. The output gff file is stored output directory and can be subsequently uploaded on a genome viewer.

Visualization of the distribution of the fold potential with ORFplot#

ORFplot enables the visualization of the distribution of the fold potential of the amino acid sequences potentially encoded in noncoding ORFs along with the one of a reference dataset of globular proteins taken from Mészáros et al. [2]. It takes as input the table generatd by ORFold that must be located in the /workdir/orfold/ directory of the container.

 orfplot --tab workdir/orfold/mapping_orf_Scer_nc.tab --labels  "Yeast noncoding ORFs" --singularity

When running orfplot inside a Singularity container, sometimes you must set the QT_QPA_PLATFORM=offscreen environment variable to avoid Qt-related errors, while this step is not required when using Docker.

 QT_QPA_PLATFORM=offscreen orfplot --tab workdir/orfold/mapping_orf_Scer_nc.tab --labels  "Yeast noncoding ORFs" --singularity

The output graphic is generated in output directory (see here for other examples of use of ORFplot). Each distribution is compared with the one of the globular protein dataset with a Kolmogorov Smirnov test. Asterisks on the plot denote level of significance: * < 0.05, ** < 0.01, *** < 0.001.

Characterization of the translation activity of all intergenic ORFs of yeast#

ORFribo, based on ribosome-profiling data, probes the translation activity of all the ORFs of interest of a genome (be they coding or noncoding) that are annotated in a gff file. We strongly recommend using ORFtrack to generate the gff file since ORFtrack and ORFribo were jointly developed so that the output of ORFtrack fits the prerequisites of ORFribo. In this example, we investigate the translation of all intergenic ORFs of yeast annotated as "nc_intergenic" by ORFtrack in its output mapping_orf_Scer.gff. An example of this file that has been generated by ORFtrack and that will be used as input by ORFribo is available in the ORFmine/examples/database/ folder.

Here is presented a simple example that enables one to run ORFribo from inputs and a pre-configurated config.yaml file which are all available in the ORFmine/examples/ folder. However, before running ORFribo on this example dataset, we recommend the user read in details the phylosophy and main steps of ORFribo.

Input files#

ORFribo relies on the fastq file(s). an example of such file is available in the ORFmine/examples/fastq folder

/fastq/SRR1520313_17031088.fastq.gz

and 4 additional files. Examples of such files are available in the ORFmine/examples/database/ folder.

Scer.fna: fasta file containing the nucleotide sequences of the chromosomes of the yeast genome
Scer.gff: the original annotation of yeast in a gff file
mapping_orf_Scer.gff: the gff file containing all the ORFs detected by ORFtrack and including the nc_intergenic ORFs for which we want to estimate the translation activity. It therefore involves that you have previously generated this file as presented here
Scer_rRNA.fa: A fasta file with the sequences that we want to remove from the mapping step, here rRNA sequences

Running ORFribo#

ORFribo is launched like this:

orfribo --fna database/Scer.fna --gff database/Scer.gff --gff-intergenic database/mapping_orf_Scer.gff --fastq fastq/ --rna-to-exclude database/Scer_rRNA.fa --out ORFribo --trimmed --singularity

Be careful to have enough memory on your computer/cluster using the parameter --ram for you analysis as a lack of RAM would lead the pipeline to crash without giving any error message. So if the program stops without any obvious reason this might be the reason why, and one should restart the pipeline with more resources.

Main outputs#

ORFribo generates many intermediate files that can be useful for further analysis. Their description is available here. The main output is the table named all_samples_genome.mean_median70_reads_concatenated.tab that summarizes the results for each ORF of interest (e.g. nb and fractions of F0, F1 and F2 reads). An example of this output table can be found in the examples.zip archive in the ORFmine/examples/ORFribo/RESULTS/Genome directory. A full description of the summary table can be found here.

Running ORFdate#

ORFdate estimates the evolutionary ages of a set of ORFs (coding but also noncoding) based on phylostratigraphy. It takes as inputs:

a distance phylogenetic tree in newick format, containing the species of interest (focal species) and its neighbors (see ORFmine/examples/database/orfdate_inputs/ for an example). All branches must be associated with their relative distance on which the age estimation will be based.
a two columns csv file (i.e. columns separated by commas) with the path to the sequence fasta files of each species as first column, and the associated species names as indicated in the phylogenetic tree as second column (see ORFmine/examples/database/orfdate_inputs/ for an example). It, therefore, involves the fasta files of the complete proteomes (amino acid sequences) of the focal and its neighbors to be stored in the corresponding directories.
the fasta files containing all the amino acid sequences of the focal and its neighboring species.

A subdirectory can be created for more clarity (e.g. /database/orfdate_inputs/). The following command line will estimate the ages of 42 yeast CDSs (subsample of yeast CDSs that can be found in the file ORFmine/examples/database/orfdate_inputs/sample_42_Scer_CDS.pfasta) based on the phylogenetic tree of the Saccharomyces sensu stricto (i.e. S. cerevisiae, S. paradoxus, S. mikatae, S. kudrievezii, S. arboricola and S. bayanus )(the corresponding tree can be found in ORFmine/examples/database/orfdate_inputs/Saccharomycew_species.nwk). Please note that the last node of the tree will be used as the upper limit for the age estimation. Consequently, all sequences with a match in the farest species, with respect to the focal, are associated with the same upper bounded estimated age regardless of whether they may have other matches in more distant species outside the input tree (see Papadopoulos et al, [3] for more details).

orfdate --target Scer --tree database/orfdate_inputs/Saccharomyces_species.nwk --mapping database/orfdate_inputs/names.csv --singularity

with the --target option corresponding to the name of the focal species as indicated in the tree. Please note that additional arguments including BLAST parameters can be provided. See here for more details.

An example of the names.csv file is available in the ORFmine/examples/database/orfdate_inputs/ directory as well as the 5 fasta files containing all the sequences of the 5 neighboring species of the focal.

ORFdate generates two outputs in csv format. Examples of these outputs can be found in the ORFmine/exmaples/workdir/orfdate/ folder.

sample_42_Scer_CDS_hits.csv which reports the number of significant matches for all sequences of the focal species (rows) across the neighbors species of the tree (columns)
sample_42_Scer_CDS_dated.csv which reports for each sequence (col 1), the farest neighbor species for which a significant math has been found (col2), and the time of divergence between this neighbor and the focal species in Mya (col3)

ORFdate generates also intermediate files (blast databases and blast outputs). Examples of such outputs can be found in the ORFmine/examples/workdir/orfdate/ directory.

References#

Bitard-Feildel, T. & Callebaut, I. HCAtk and pyHCA: A Toolkit and Python API for the Hydrophobic Cluster Analysis of Protein Sequences. bioRxiv 249995 (2018).
Mészáros B, Erdős G, Dosztányi Z (2018) IUPred2A: context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic acids research 46:W329–W3372.
Papadopoulos, C., Arbes, H., Chevrollier, N., Blanchet, S., Cornu, D., Roginski, P., Rabier, C., Atia, S., Lespinet, O., Namy, O., Lopes, A. (submitted).