Output folder architecture (/workdir/orfribo/):#

Here is the folder architecture of the ORFribo output stored in /workdir/orfribo.

The following output example is based on the example provided in the /ORFmine/examples/ directory with the fastq SRR1520313_17031088.

/workdir/orfribo/
├── dag_all.svg
├── dag_last_run.svg
├── RESULTS/
    ├── config.yaml
    ├── ORFribo_yeast_example.Analysis_Report.txt
    ├── adapter_lists/
        └── SRR1520313_17031088.txt
    ├── Bam2Reads_genome_output/
        ├── all_samples_genome.25-35.mean70_median70_reads_concatenated.tab
        └── SRR1520313_17031088
            ├── genome.25-35.mean70_median70_reads_concatenated.tab
            ├── length_25
                ├── genome.25-35.mean70_median70_reads.tab
                ├── genome.25-35.mean70_median70_periodicity_all.tab
                ├── genome.25-35.mean70_median70_periodicity_start.tab
                └── genome.25-35.mean70_median70_periodicity_stop.tab
            ├── length_26
                ├── genome.25-35.mean70_median70_reads.tab
            ├ ...
            └── length_35
                └── genome.25-35.mean70_median70_reads.tab
    ├── annex_database/
        ├── NamedCDS_Scer.gff
        ├── index_bowtie2.1.bt2
        ├ ...
        ├── index_hisat2.1.ht2
        ├ ...
        ├── outRNA_bowtie2.1.ht2
        └ ...
    ├── fastqc/
        ├── SRR1520313_17031088_fastqc.html
        └── SRR1520313_17031088_fastqc.zip
    ├── selected_tables/
        └── threshold_mean70_median70
            └── SRR1520313_17031088.txt
    ├── BAM/
        ├── SRR1520313_17031088.25-35.bam
        └── SRR1520313_17031088.25-35.bam.bai
    ├── no-outRNA/
        └── SRR1520313_17031088.25-35.no-outRNA.fastq.gz
    └── ORFribo/
        ├── database/
            ├── exome.nfasta
            ├── exome_elongated.nfasta
            ├── exome_elongated.nfasta.fai
            ├── exome_elongated.gff
            └── exome_elongated_with_gene_features.gff
        ├── annex_database/
            ├── exome_elongated.exons_Scer.fna
            ├── exome_elongated.exons_Scer.gff.gtf
            ├── exome_elongated.exome_index_bowtie2.1.bt2
            ├── ...
            ├── exome_elongated.exome_index_hisat2.1.ht2
            └── ...
        ├── BAM_exome/
            ├── exome_elongated.SRR1520313_17031088.25-35.bam
            └── exome_elongated.SRR1520313_17031088.25-35.bam.bai
        ├── Bam2Reads_exome_output/
            ├── SRR1520313_17031088_25
                ├── exome.25-35_reads.tab
                ├── exome.25-35_periodicity_all.tab
                ├── exome.25-35_periodicity_start.tab
                ├── exome.25-35_periodicity_stop.tab
                └── exome.25-35_reads.stats
            ├── ...
            └── SRR1520313_17031088_35
                ├── exome.25-35_reads.tab
                ├── exome.25-35_periodicity_all.tab
                ├── exome.25-35_periodicity_start.tab
                ├── exome.25-35_periodicity_stop.tab
                └── exome.25-35_reads.stats
        └── riboWaltz/
              ├── psite_offset.csv
              ├── best_offset.txt
              └── exome_elongated.SRR1520313_17031088
                    ├── 21.tiff
                    ├── 22.tiff
                    ├── ...
                    └── 35.tiff
├── benchmarks/
    ├── adapt_trimming
        └── SRR1520313_17031088.benchmark.txt
    └── ...
├── logs/
    ├── RiboDoc_package_versions.txt
    ├── adapt_trimming
       ├── SRR1520313_17031088_cutadapt.log
       └── SRR1520313_17031088_trim_value.log
   └── ...
└── logsTmp/
    ├── SRR1520313_17031088_adapt_trimming.log
    ├── SRR1520313_17031088_bowtie2_run_outRNA.log
    ├── SRR1520313_17031088_run_mapping_bowtie2.log
    └── SRR1520313_17031088_run_mapping_hisat2.log

Main output table#

The main output table can be found in the RESULTS/Bam2Reads_genome_output/ folder and is named all_samples_genome.25-35.meanXX_medianYY_reads_concatenated.tab (with X and Y standing for the median and mean thresholds, here 70 and 70 respectively). This table summarizes the results for each ORF (i.e. numbers and fraction of reads in the three frames of the ORF) calculated for all the retained kmers in each input dataset (in this example, there is only one dataset - SRR1520313_17031088.). If ORFribo has been performed on multiple datasets, it also provides for each dataset a folder named according to the dataset that contains a table with the same format as the all_samples_genome.25-35.meanXX_medianYY_reads_concatenated.tab table but calculated only on the retained kmers of the corresponding dataset dataset_XYZ/genome.25-35.mean70_median70_reads_concatenated.tab. The directories of each dataset also contain the intermediate tables calculated for each retained kmer. The latter are stored in the RESULTS/Bam2Reads_genome_output/dataset_xyz/length_i/ directory where i stands for the kmer sizes. When only one dataset has been provided to ORFribo (as in this example), the output table present in the dataset folder is the same as the all_samples_genome.25-35.meanXX_medianYY_reads_concatenated.tab present in the Bam2Reads_genome_output directory.

The summary table all_samples_genome.25-35.meanXX_medianYY_reads_concatenated.tab contains for each ORF of the studied category(ies), the number and fraction of reads in its three frames (frame 0 (F0 or P0) and its two alternative frames +1 and +2 (P1/F1 and P2/F2 respectively)). This table has 8 columns:

Seq_ID : Identifier of the ORF
Num_reads : Number of reads having a P-site aligned on the ORF
Num_p0 : Number of reads with their P-site in phase 0 of the ORF
Num_p1 : Number of reads with their P-site in phase 1 of the ORF
Num_p2 : Number of reads with their P-site in phase 2 of the ORF
Perc_p0 : Percentage of reads aligned on the ORF with their P-site in phase 0 (i.e. Num_p0 divided by the total number of reads mapping on the ORF)
Perc_p1 : Percentage of reads aligned on the ORF with their P-site in phase 1
Perc_p2 : Percentage of reads aligned on the ORF with their P-site in phase 2

Output folder details :#

  • The dag files which represents the analysis steps with your datasets.

  • The logs/ folder groups together all the error output messages from tools used in ORFribo analysis pipeline. Thus, in the event of an error, it allows you to identify the problematic step (and give us feedback if needed).

  • The RESULTS/ folder contains these files and folders:

  • i) PROJECT_NAME.Analysis_report.txt gathers standard output of each analysis pipeline tool. It allows to know how many reads are present at each step of the analysis : a)raw reads b)reads after trimming and length selection c)after out RNA depletion d)after double alignment on the reference genome.

  • *ii) config.yaml to have a backup of the parameters.

  • *I) Bam2Reads_genome_output/: Contains the final outputs of the analysis (explained above).

  • *II) annex_database/: It contains the indexes for the genome alignment and the gff with all CDS named.

  • III) fastqc/: It contains data quality controls.

  • IV) adapter_lists/: It contains a text file with the adapters list for each dataset that were found in the config.yaml file or determined from data if the user did not provide any adapter sequence in the configuration file.

  • V) selected_length_tables/: It contains one subfolder with a file for every threshold the user chose for the alignment on CDSs step (median or mean of P0 proportions) as multiple thresholds may be tried. The file keeps the information of which read length passed the threshold and was kept for the alignment on all ORFs.

  • VI) BAM/: It contains a BAM file for each dataset (allows visualization on tools such as IGV).

  • VII) no-outRNA/: It contains fastq files trimmed and after removal of the reads aligned on unwaned sequences.

  • VIII) ORFribo/:

    • I) Bam2Reads_exome_output/: It contains one subfolder per kmer (dataset_XYZ_i with i standing for the kmer size) that itsefl contains a table exome.25-35_reads.tab. The information in this table is the same as that contained in the all_samples_genome.25-35.meanXX_medianYY_reads_concatenated.tab table described above but only for the CDSs annotated in the original gff file. Each CDS is associated with the fractions of reads that map on its coding frame (i.e. in-frame reads named also P0 reads) or in its alternative frames. These tables are used for the detection of good quality kmers (Step 2) but can also be used to probe the translation activity of CDSs. In this case, you just need to merge the tables of all retained kmers (the list of the retained kmers can be found in the file /workdir/orfribo/RESULTS/selected_tables/threshold_meanXX_medianXX/dataset_XYZ.txt) into a global table (see here for more details). This final output table will contain for each CDS, the number and fraction of reads mapping in-frame or in the +1 and +2 frames of each CDS.
    • II) database/: It contains the re-formatted fasta and gff with artificial elongated CDSs to avoid the missing of reads which align on the borders of CDSs (i.e. on the start and stop codons).
    • III) annex_database/: It contains the indexes for the exome alignments (alignments on all exons of CDSs as indicated in the original gff file) and the gtf for riboWaltz.
    • IV) BAM_exome/: It contains a BAM file for each dataset corresponding to the alignment on the fasta with transcript by transcript feature.
    • V) riboWaltz/: It contains the P-site offsets file.

Particular case of multiple inputs#

ORFribo can be launched on multiple datasets (fastqs) as long as the input files concern the same organism - i.e. share the same reference fasta and gff files (more details here). For this, you just have to put all the fastq files in the /fastq/ folder and launch ORFribo as you usually do with:

orfribo CPU MEMORY

The output architecture will be the same as the one obtained for a single input, except that all intermediate files will be stored in subdirectories corresponding to each input dataset as follows:

orfribo/
├── dag_all.svg
├── dag_last_run.svg
├── RESULTS/
    ├── config.yaml
    ├── adapter_lists/
        └── one_file_by_sample.txt
    ├── Bam2Reads_genome_output/
        ├── all_samples_genome.25-35.mean70_median70_reads_concatenated.tab"
        └── one_folder_by_sample
            ├── concatenated_results_table.tab
            └── one_folder_by_length
                └── Bam2Reads_results_for_all_ORFs_alignment
    ├── annex_database/
        ├── gff_files_with_named_CDSs.gff
        ├── indexes_for_bowtie2_alignments.bt2
        ├── indexes_for_hisat2_alignments.ht2
    ├── fastqc/
        ├── one_html_by_sample.html
        └── one_zip_by_sample.zip
    ├── selected_tables/
        └── one_folder_by_sample
    ├── BAM/
        ├── one_bam_by_sample.bam
        └── one_bai_by_bam.bai
    ├── no-outRNA/
        └── one_file_by_sample.fastq.gz
    └── ORFribo/
        ├── database/
            ├── intermediate_fasta_files.fa
            ├── intermediate_gff_files.gff
        ├── annex_database/
            ├── intermediate_fasta_files.fa
            ├── intermediate_gtf_file_for_riboWaltz.gtf
            ├── indexes_for_bowtie2_alignments.bt2
            └── indexes_for_hisat2_alignments.ht2
        ├── BAM_exome_output/
            ├── one_bam_by_sample.bam
            └── one_bai_by_bam.bai
        ├── Bam2Reads_exome_output/
            └── one_folder_by_sample_and_length
                └── Bam2Reads_results_for_CDS_alignment
        └── riboWaltz/
              └── riboWaltz's qualitative analysis results
├── benchmarks/
    ├── one_benchmark_folder_by_rule
        └── one_benchmark_file_by_job
├── logs/
    └── one_log_folder_by_rule
        └── one_log_file_by_job_and_command
├── logsTmp/
    └── one_file_by_steps_of_interest_for_alignment_stats