Before launching ORFribo: prepare the config.yaml#

ORFribo is very easy to handle and only needs a configuration file to be edited before running it. The latter named config.yaml contains the parameters that can be adjusted by the user. Here we describe how to fill this file (a pre-filled configuration file is present in the ORFmine/examples/workdir/ directory and corresponds to the example dataset provided in ORFmine/examples/). Please note that every file's name must be written without path and they must be placed in your local folder linked to the /database/ directory of the container. The fastq files must be stored in your local fastq/ directory that has been linked to the /fastq/ directory of the container when having launched the container. Please note that we strongly recommend reading the How it works? page before completing the configuration file.

Please place the config.yaml file in the /workdir/ directory of your container and edit it as follows (WARNING all parameters must be written between quotes as in the example):

Project name#

project_name: "intergenic_ORF_translation"
(the name of your project without spaces or special characters)

Input names#

You must enter the full name (with extensions and without the path) of the genome sequence and annotation files of the genome to be treated. These two files must be stored in the /database/ folder of the container. Examples of such files can be found in the ORFmine/examples/database/ folder.

fasta: "Scer.fna"
(i.e. reference_genome_sequences.fa: fasta file containing the nucleotide sequence of the complete genome)

gff: "Scer.gff"
(i.e. reference_genome_annotations.gff: annotation file)

gff_intergenic: "mapping_orf_Scer.gff"
(i.e. ORFtrack_output.gff: name of the ORFtrack output gff file containing the ORF coordinates for which you want to study the translation activity)

fasta_outRNA: "Scer_rRNA.fa"
(i.e. NA_sequences_to_remove.fa: sequences for which you do not want to map Ribo-Seq reads - usually rRNAs etc)

Pipeline option selection#

During the ORFribo process, data is trimmed and selected depending on the read lengths.

already_trimmed: "no"
(If your data contains reads already trimmed of their adapter, you can set this option to “yes”. Else, set it to "no")

adapt_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
(If they are not trimmed, you should specify the sequence of the adapter in quotes on the line here like "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA". If you do not put anything between the quotes, and the option "already_trimmed" has been set to "no", ORFribo will try to find the adapter itself but this can sometimes lead to a wrong adapter sequence)

Kmers (or read sizes) to analyze#

You have to define the range of read sizes (kmers) that will be tested. By default, ORFribo will test the quality of kmers/reads from 25 to 35 bases long and will retain those of good quality (i.e. those which mostly map to the coding phase (P0) of CDSs - "mostly" corresponding to a median value of 70% but it can be modified by the user).

readsLength_min: "25"
(minimum read size (default 25))

readsLength_max: "35"
(maximum read size (default 35))

Names of coding features for the detection of kmers of good quality (Step 2)#

You might also need to specify the names, as indicated in the original gff file (not the one generated by ORFtrack), of the "CDS" feature and "gene" attribute so that ORFribo is able to identify the features/attributes that correspond to the CDSs and genes. That said, we strongly recommend using the "CDS" and "gene" keywords for the CDSs and genes respectively as they are used by ORFtrack in its gff output. Using different keywords for these two features might generate conflicts with the detection of good quality kmers (Step2) and the read mapping step (Step3) which is realized on the ORFtrack output or equivalent. To avoid such conflicts, we recommend modifying in the original gff file the names of the attributes/features used for the genes and CDSs to "gene" and "CDS" respectively.

gff_cds_feature: "CDS"
(Feature name (column 3) corresponding to CDSs in the original annotation gff file that will be used for the second step "Detection of kmers of good quality". Usually, CDSs are referred to as "CDS" in the gff files, so "CDS" is the default value. That said, one should note that CDSs may be referred to as "ORF" in some gff files, in this case, this option should be set to "ORF", though we strongly recommend editing the initial gff file and replace the feature name "ORF" by "CDS".)

gff_name_attribut: "Name"
(Name of the gene name attribute in the initial gff file (column 9). Default is "Name" but like for CDSs, please control the name referring to the gene name attribute and modify this option accordingly (some gff files refer to gene name attributes as "gene_name"), though we strongly recommend editing the initial gff file and replace the feature name "gene_name" by "gene".)

Names of features or ORF categories to be analyzed (noncoding but also coding)#

final_counts: "nc_intergenic"
(List of ORF categories for which you want to investigate the translation activity (Step3). The ORF categories listed here must correspond to those provided by ORFtrack in the 3rd column of the output gff file (the ORF categories identified in your input genome can also be found in the summary.log of ORFtrack). Examples of these two files can be found in the ORFmine/examples/database/ directory as maping_orf_Scer.gff or summary.log files. By default, ORFribo will probe the translation activity of all intergenic ORFs which are referred to as "nc_intergenic" (see here for more details on the ORF categories and annotation process). If you want to also probe that of noncoding ORFs lying in the alternative frames of CDSs on the same strand for example, you must add the "nc_ovp_same-CDS" flag separated by a space as follows: "nc_intergenic nc_ovp_same-CDS".)

Statistical settings#

orfstats_mean_threshold: "70"
(Minimum mean of in-frame reads in coding regions to select specific read sizes/kmers (default is 70))

orfstats_median_threshold: "70"
(Minimum median of in-frame reads in coding regions to select specific read sizes/kmers (default is 70))