Chapter 2 Reproducing Computation Environment
What is a Computation Environments?
Computation Environments is an interface from where a tools/command can be called if it is installed in that environment.
Example:
Whenever you open a bash terminal there are certain tools/commands available to you for use. Those are most probably present in your
.bashrc
or a path is exported.
If you using lots of different bioinformatics tools at some point of time must have downloaded a tool, complied (if not already in binary) and then exported path.
Or if you are lucky, then downloaded using some system package manager, which automatically exports the path for you.
Basically you have told the computer to make available the tool in you current environment. So that you can just call the tool name instead of giving the whole path every time.
But maintaining this environments becomes very difficult when you working with lots of tools or in a shared computer(HPC) and Cloud.
And what if tomorrow if you wants to change your system or you tell you collaborator to run the whole thing what you have run. Then in both the case all the tools need to be installed again from scratch.
In that case following tools/platforms comes handy
- Conda - An open-source package manager and environment management system.
- Docker - A light weight platform which helps create portable virtual environment.
2.1 A basic conda guide
Conda is a package manager and environment management system. That means you can install any tool using this and at the same time you can use this to manager those tools in different environments.
Tools/Packages in conda are distributed in different channels. Read more about conda channels here
2.1.1 Use case
Lets say a Sam is a researcher, who recently sequenced some data and received FASTQ files. He want to do quality check of those data in a reproducible, so that if a college of his try to reproduce the results he can easily do so.
fastqc tool is going to be used here. Let install that using conda.
A simple google of fastqc conda
will land you at this fastqc conda home page
Or from command line -
conda search fastqc -c bioconda
Loading channels: done
# Name Version Build Channel
fastqc 0.10.1 0 bioconda
fastqc 0.10.1 1 bioconda
fastqc 0.11.2 pl5.22.0_0 bioconda
fastqc 0.11.3 0 bioconda
fastqc 0.11.3 1 bioconda
fastqc 0.11.7 5 bioconda
fastqc 0.11.7 6 bioconda
fastqc 0.11.7 pl5.22.0_0 bioconda
fastqc 0.11.7 pl5.22.0_2 bioconda
fastqc 0.11.8 2 bioconda
fastqc 0.11.9 0 bioconda
Most of the tools used by Bioinformatics applications are available in bioconda channel. So If you looking for a tool just google it followed by bioconda, and in most cases the first or second link will take to its conda home page. There you will get information about how to install it. Also, you can search for the tool from command line as well.
Here, the output shows different version. If you want you can mention a particular version while installing (such as fastqc=0.11.9
) otherwise conda itself choice a particular latest version based on suitable dependency.
lets install fastp in an isolated conda environment qc
This may take few minutes depending upon the number of dependency
Why isolated enviroment is required? Because this makes sure clash of dependency for the tools doesn’t happen.
To use this newly created environment, we need to enter into it
Here we can normally use the tools/commands.
Now Sam can share this complete environment to his college by just generating a yml file.
Read about conda in reproducible-research
2.2 A basic docker guide
Docker is a tool which helps you create light weight containers, which are portable.
You can think a container as a Computer system where all you tools are installed and ready to use.
The logic to make a container can be write in a Dockerfile
An example Dockerfile
FROM ubuntu:19.10
RUN apt-get update
RUN apt-get install -y fastqc
Built the docker image
Run the docker container
You should see the fastqc
help menu
Read about docker in reproducible-research
TODO: advance - alphin images, docker-compose
TODO: packrat/renv
Read more about Reproducible environments in The Turing Way Project