BEAURIS general documentation

BEAURIS: an automated system for the creation of genome portals

This repository contains the common code needed to deploy any BEAURIS genome portal. To make use of it, you will need to create another GitLab repository, following the provided example.

Have a look at the BEAURIS presentation during JOBIM 2023

Big picture

BEAURIS is a system that allows to automate the generation of genomic web portals from raw genomic data (genome sequence, annotation, RNASeq data alignments, etc, etc).

BEAURIS takes as input:

  • Some raw data files (fasta, gff, bam, tsv, …) stored on a storage system

  • Some yaml files describing the raw data: their location, and some metadata

  • A configuration file (beauris.yml) defining what you want to do with this data

Based on that BEAURIS will run a complete workflow:

  • Validate the content of the raw data files (e.g. detect broken GFFs) and of the yml files (e.g. detect wrong syntax)

  • Try to correct obvious errors in raw data files (if possible)

  • Launch a selection of data manipulation tasks to produce “derived” data

    • e.g. functional annotation pipelines, indexing tasks, format conversions, pre-generated web content

    • these tasks can run on a Galaxy server, or on a DRMAA-compatible computing cluster (e.g. Slurm)

  • Raw and derived data are then “locked” = placed in a stable disk location for long-term storage

  • Finally web applications are deployed by launching Docker containers

    • Each container can mount raw and/or derived data as needed

    • They are currently deployed on a Swarm cluster, though switching to Kubernetes should be implemented at some point.

BEAURIS is implemented this way:

  • The code itself is written in Python, in the BEAURIS module

  • Users willing to use it with their data need to create a site-specific GitLab repository, following a provided example

  • The yaml files referencing raw data are hosted on these site-specific GitLab repositories (which can be private or public)

  • All the BEAURIS workflow is executed using GitLab CI, on a site-specific GitLab Runner

  • The GitLab Runner needs to have access to a DRMAA-compatible cluster, to a Swarm Docker cluster, and some sharedstorage space

BEAURIS is modular, the following modules are already available:

  • Handle FASTA/GFF/BAM/TSV formats with indexing and QC checking

  • Generate functional annotation with state-of-the-art tools (InterProScan, EggNOG-Mapper, …)

  • Generate JBrowse using the Galaxy JBrowse tool

  • Load organisms into an Apollo instance

  • Deploy a Blast form

  • Deploy a simple data download page

  • Deploy a GeneNoteBook web server (WIP)

  • Manage access permissions based on an LDAP server

The main benefits of using BEAURIS are:

  • Users can deploy complete web portals by simply filling yaml files, and not caring about how BEAURIS will deploy it

  • The usage of Merge Requests allows to

    • Manual validation of yaml file content by administrators

    • Deployment of web portals in a staging environment to check the result before deploying in production

  • BEAURIS is paying a great attention at running only the needed processes everytime the yaml files are touched

More details

Data processing

Adding a new genome means writing a new yml file in the ./organisms directory and proposing it in a Merge Request.

The data generation jobs are launched before merging because these steps might produce errors that we want to fix before merging (see “Data Locking” below). A MR-specific temp directory is created where all temporary and derived files are stored.

We avoid rerunning steps for data that:

  • have already been processed in a previous MR

  • have already partially processed in the case of an error in the CI workflow.

This is done by comparing the content of the yml files in ./organisms and ./locked, and checking which files are already present in the MR-specific temp directory.

Adding a new annotation to an existing genome means modifying the existing yml file in ./organisms and proposing the change in a Merge Request. BEAURIS will run task depending on the added raw files

Job execution

Jobs are running in various ways:

  1. Slurm (or other DRMAA cluster) jobs (e.g.: functional annotation)

  2. Galaxy tool (and/or workflow) invocation (e.g. JBrowse, GeneNoteBook)

  3. NextFlow workflows on a Slurm (or other HPC) infra (e.g. function annotation at SEBIMER)

We can have very long jobs (>1 day) exceeding the timeout of a CI job. BEAURIS is able to catch up with any currently running or already finished jobs at any time.

Modules

Have a look at BEAURIS modules for more details about available modules.

Authors

BEAURIS was developped initially by:

See up-to-date contributors list.

FAQ

Why GitLab instead of GitHub?

Mainly because many research institutes are now hosting their own GitLab instances and CI runners, with specific sets of features and security policies.

Although we designed and deployed BEAURIS using GitLab, the code itself was written to be quite generic and adaptable to other platforms like GitHub. If anyone is interested we would be happy to help testing it.

Why is it named BEAURIS?

No reason, it’s not an acronym, it is written in capital letters because it is meant to be pronounced LOUD.

(After long investigations, it seems like BEAURIS came from a mishearing of another project name)