Chapter 2 Reproducing Computation Environment

What is a Computation Environments?

Computation Environments is an interface from where a tools/command can be called if it is installed in that environment.

Example:

Whenever you open a bash terminal there are certain tools/commands available to you for use. Those are most probably present in your .bashrc or a path is exported.

If you using lots of different bioinformatics tools at some point of time must have downloaded a tool, complied (if not already in binary) and then exported path.

export PATH="tool-path:$PATH"

Or if you are lucky, then downloaded using some system package manager, which automatically exports the path for you.

apt-get install fastqc

Basically you have told the computer to make available the tool in you current environment. So that you can just call the tool name instead of giving the whole path every time.

But maintaining this environments becomes very difficult when you working with lots of tools or in a shared computer(HPC) and Cloud.

And what if tomorrow if you wants to change your system or you tell you collaborator to run the whole thing what you have run. Then in both the case all the tools need to be installed again from scratch.

In that case following tools/platforms comes handy

  • Conda - An open-source package manager and environment management system.
  • Docker - A light weight platform which helps create portable virtual environment.

2.1 A basic conda guide

Conda is a package manager and environment management system. That means you can install any tool using this and at the same time you can use this to manager those tools in different environments.

Tools/Packages in conda are distributed in different channels. Read more about conda channels here

2.1.1 Use case

Lets say a Sam is a researcher, who recently sequenced some data and received FASTQ files. He want to do quality check of those data in a reproducible, so that if a college of his try to reproduce the results he can easily do so.

fastqc tool is going to be used here. Let install that using conda.

A simple google of fastqc conda will land you at this fastqc conda home page

Or from command line -

conda search fastqc -c bioconda

Loading channels: done
# Name                       Version           Build  Channel             
fastqc                        0.10.1               0  bioconda            
fastqc                        0.10.1               1  bioconda            
fastqc                        0.11.2      pl5.22.0_0  bioconda            
fastqc                        0.11.3               0  bioconda            
fastqc                        0.11.3               1  bioconda            
fastqc                        0.11.7               5  bioconda            
fastqc                        0.11.7               6  bioconda            
fastqc                        0.11.7      pl5.22.0_0  bioconda            
fastqc                        0.11.7      pl5.22.0_2  bioconda            
fastqc                        0.11.8               2  bioconda            
fastqc                        0.11.9               0  bioconda    

Most of the tools used by Bioinformatics applications are available in bioconda channel. So If you looking for a tool just google it followed by bioconda, and in most cases the first or second link will take to its conda home page. There you will get information about how to install it. Also, you can search for the tool from command line as well.

Here, the output shows different version. If you want you can mention a particular version while installing (such as fastqc=0.11.9) otherwise conda itself choice a particular latest version based on suitable dependency.

lets install fastp in an isolated conda environment qc

conda install fastqc -c bioconda -n qc

This may take few minutes depending upon the number of dependency

Why isolated enviroment is required? Because this makes sure clash of dependency for the tools doesn’t happen.

To use this newly created environment, we need to enter into it

conda activate qc

Here we can normally use the tools/commands.

Now Sam can share this complete environment to his college by just generating a yml file.

conda env export > qc.yml

Read about conda in reproducible-research

2.2 A basic docker guide

Docker is a tool which helps you create light weight containers, which are portable.

You can think a container as a Computer system where all you tools are installed and ready to use.

The logic to make a container can be write in a Dockerfile

An example Dockerfile

FROM ubuntu:19.10

RUN apt-get update
RUN apt-get install -y fastqc 

Built the docker image

cd docker
docker build -t fastqc:latest .

Run the docker container

docker run --rm fastqc:latest fastqc --help

You should see the fastqc help menu

Read about docker in reproducible-research

TODO: advance - alphin images, docker-compose

TODO: packrat/renv

Read more about Reproducible environments in The Turing Way Project