Extraction and writing of ORF sequences with ORFget#

ORFget is a tool provided with ORFtrack that allows the user to extract the protein and/or nulceotide sequences of specific subsets of ORFs according to their annotation categories (see here for a description of all ORF categories). ORFget deals with annotation patterns, thereby allowing different levels of annotation in a very easy fashion.

ORFget has two principal options:

  • --features_include: list of motifs that will be used to define the ORFs that will be included in the FASTA output. The sequences whose annotations include these patterns will be retained in the output FASTA file
  • --features_exclude: list of motifs that will be used to define the ORFs that will be excluded in the FASTA output. The sequences whose annotations include these patterns will not be written in the output FASTA file

The searched patterns can be specific (for a finer selection) or more general.

Note

For example, the motif "nc" appears in the features: nc_intergenic, nc_ovp_same-mRNA, nc_ovp_opp-mRNA and nc_ovp_same-tRNA. As a result, the option -feature_include nc will keep all the four feature categories.

The option -feature_include nc_ovp will keep:

  • nc_ovp_same-mRNA
  • nc_ovp_opp-mRNA
  • nc_ovp_same-tRNA
The option -feature_include nc_ovp_same will keep:
  • nc_ovp_same-mRNA
  • nc_ovp_same-tRNA
The option -feature_include mRNA will keep:
  • nc_ovp_same-mRNA
  • nc_ovp_opp-mRNA
The option -feature_exclude opp will eliminate the nc_ovp_opp-mRNA and will keep:
  • nc_intergenic
  • nc_ovp_same-mRNA
  • nc_ovp_same-tRNA
  • etc...

Here are presented some examples of selection of ORFs with ORFget.

Extraction of the sequences of all the ORFs of a GFF file#

The following command writes the amino acid sequences of all ORFs annotated in the input GFF file.

orfget --fna genome.fasta --gff mapping_orf_genome.gff --singularity 

ORFget generates a FASTA file containing all the corresponding amino acid sequences.

Extraction of the sequences of all noncoding ORFs identified with ORFtrack#

The following commands, each enable the user to write the amino acid sequences of all noncoding ORFs no matter their status (i.e. intergenic or overlapping) (see here for a description of all ORF categories).

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_include nc --singularity 

or

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_include nc_intergenic nc_ovp --singularity 

or

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_exclude c_CDS  --singularity 

Extraction of the sequences of a specific subset of ORFs according to their annotation#

The following instruction writes the amino acid sequences of the ORFs which overlap with CDS on the same, or on the opposite strand.

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_include nc_ovp_same-CDS nc_ovp_opp-CDS  --singularity 

Notice that using the argument "features_exclude" assumes that the selection operates on all genomic features except those that are excluded. Consequently, if the user wants to select all noncoding sequences except those overlapping CDS, mRNAs, tRNAs, and rRNAs, he must exclude the coding ORFs (c_CDS) as well. Otherwise, they will be kept.

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_exclude c_CDS nc_same_ovp-tRNA nc_same_ovp-rRNA nc_opp_ovp-mRNA nc_opp_ovp-tRNA nc_opp_ovp-rRNA nc_opp_ovp-mRNA  --singularity 

Extraction of the sequences of a random subset of ORFs#

Sometimes, for computational time or storage reasons, the user does not want to deal with all the ORFs of a specific category. ORFget can provide the user with a subset of N (to be defined by the user) randomly selected ORFs from a specific ORF category. The last instruction writes the sequences of 10000 randomly selected noncoding intergenic ORFs.

orfget --fna genome.fasta --gff mapping_orf_genome.gff --features_include nc_intergenic -n 10000  --singularity 

Reconstruction of protein sequences#

In addition, ORFget enables the reconstruction of all protein sequences of a genome (i.e. all isoforms) according to their definition in the original GFF file. The following instruction writes all the resulting sequences in a FASTA file.

orfget --fna genome.fasta --gff genome.gff --features_include CDS  --singularity 

Writing amino acid or nucleotide sequences#

By default, ORFget will generate the amino acid sequences of the desired ORFs in a FASTA file with the extension .pfasta. If the user wishes to generate the nucleotide or even both nucleotide and amino acids sequences, he must use the option --type nucl and --type both, respectively.