Extraction and writing of ORF sequences with ORFget#

ORFget is a tool provided with ORFtrack that allows the user to extract the protein and/or nulceotide sequences of specific subsets of ORFs according to their annotation categories (see here for a description of all ORF categories). ORFget deals with annotation patterns, thereby allowing different levels of annotation in a very easy fashion.

ORFget has two principal options:

  • -features_include: list of motifs that will be used to define the ORFs that will be included in the fasta output. The sequences whose annotations include these patterns will be retained in the output fasta file
  • -features_exclude: list of motifs that will be used to define the ORFs that will be excluded in the fasta output. The sequences whose annotations include these patterns will not be written in the output fasta file

The searched patterns can be specific (for a finer selection) or more general.

Note

For example, the motif "nc" which refers to all NonCoding ORFs appears in the features: nc_intergenic, nc_ovp_same-mRNA, nc_ovp_opp-mRNA and nc_ovp_same-tRNA. As a result, the option -feature_include nc will keep all the four feature categories.

The option -feature_include nc_ovp will keep:

  • nc_ovp_same-mRNA
  • nc_ovp_opp-mRNA
  • nc_ovp_same-tRNA
The option -feature_include nc_ovp_same will keep:
  • nc_ovp_same-mRNA
  • nc_ovp_same-tRNA
The option -feature_include mRNA will keep:
  • nc_ovp_same-mRNA
  • nc_ovp_opp-mRNA
The option -feature_exclude opp will eliminate the nc_ovp_opp-mRNA and will keep:
  • nc_intergenic
  • nc_ovp_same-mRNA
  • nc_ovp_same-tRNA
  • etc...

Here are presented some examples of selection of ORFs with ORFget.

Extraction of the sequences of all the ORFs of a GFF file#

The following command writes the amino acid sequences of all the ORFs of the gff file annotated by ORFtrack.

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff

ORFget generates a fasta file containing all the corresponding amino acid sequences. The output fasta file is written in the /database/ directory of the container.

Note

It can also handle gff files that were not generated by ORFtrack but in this case the user must be sure of the feature names to be indicated if using the -feature_include/exclude options. In this case, the features must correspond to those indicated in the 3rd column of the input gff file.

Extraction of the sequences of all noncoding ORFs identified with ORFtrack#

The following commands, each enable the user to write the amino acid sequences of all noncoding ORFs no matter their status (i.e. intergenic or overlapping) (see here for a description of all ORF categories).

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff -features_include nc

or

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff -features_include nc_intergenic nc_ovp

or

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff -features_exclude c_CDS

The output fasta file is written in the /database/ directory of the container and renamed based on the gff file rootname and the include features.

Extraction of the sequences of a specific subset of ORFs according to their annotation#

The following instruction writes the amino acid sequences of the ORFs which overlap CDSs on the same, or on the opposite strand.

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff -features_include nc_ovp_same-CDS nc_ovp_opp-CDS

Notice that using the argument "features_exclude" assumes that the selection operates on all genomic features except those that are excluded. Consequently, if the user wants to select all noncoding sequences except those overlapping CDSs, mRNAs, tRNAs, and rRNAs, he/she must exclude the coding ORFs (c_CDS) as well. Otherwise, they will be kept.

orfget -fna genome.fasta -gff mapping_orf_genome.gff -features_exclude c_CDS nc_same_ovp-tRNA nc_same_ovp-rRNA nc_opp_ovp-mRNA nc_opp_ovp-tRNA nc_opp_ovp-rRNA nc_opp_ovp-mRNA  

The output fasta file is written in the /database/ directory of the container and renamed based on the gff file rootname and the include features.

Extraction of the sequences of a random subset of ORFs#

Sometimes, for computational time or storage reasons, the user does not want to deal with all the ORFs of a specific category. ORFget can provide the user with a subset of N (to be defined by the user) randomly selected ORFs from a specific ORF category. The last instruction writes the sequences of 10000 randomly selected noncoding intergenic ORFs.

orfget -fna /database/genome.fasta -gff /database/mapping_orf_genome.gff -features_include nc_intergenic -n 10000

Reconstruction of protein sequences#

In addition, ORFget enables the reconstruction of all protein sequences of a genome (including all isoforms) according to their definition in the original gff file. The following instruction writes all the resulting sequences in a fasta file. Please note that in this case, the input gff file is the initial one, not the one generated by ORFtrack which is ORF-centered and that contains C_CDS ORFs instead of the exact CDSs (i.e. ATG-STOP including isoforms) as indicated in the original gff file.

orfget -fna /database/genome.fasta -gff /database/genome.gff -features_include CDS

Writing amino acid or nucleotide sequences#

By default, ORFget will generate the amino acid sequences of the desired ORFs in a fasta file with the extension .pfasta. If the user wishes to generate the nucleotide or even both nucleotide and amino acids sequences, he/she must use the option -type nucl and -type both, respectively.