Docker for Reproducible Analyses in R

Tutorial

Running analyses in Docker to reproduce results

Reproducibility is one of the cornerstones of scientific research, ensuring that findings can be independently verified and built upon. While there are lots of concepts related to the open science best practices, the focus of the tutorial here is on reproducibility — where the original researcher's data and computer codes are used to recreate the result. There are many degrees of reproducibility, and we should aim for the gold standard of full replication.

image

In this tutorial, we'll walk through the process of setting up a reproducible research environment using R, Docker, and GitHub. We'll focus on a specific analysis related to inflammation markers, providing scripts and resources for replication.

Prerequisites

  • Basic familiarity with R programming language
  • Understanding of version control with Git and GitHub
  • Familiarity with Docker concepts

Step 1: Cloning the Repository

  1. Open a terminal or command prompt.
  2. Clone the repository to your local machine using Git:
git clone https://github.com/nghuixin/reproducible_research.git

Step 2: Setting Up the Analysis Environment with Docker

  1. Docker is a platform that allows you to package, distribute, and run applications in containers. Containers are lightweight, standalone, and portable environments that include everything needed to run an application, including the system libraries, code, runtime, and system tools. Docker ensures that the analysis environment remains consistent across different machines.
  2. Pull the Docker image for the analysis environment by running the following command:
docker pull nghuixin/infl_marker_analysis:1.0.1
  • This command fetches the Docker image named nghuixin/infl_marker_analysis from Docker Hub, a repository for Docker images.
  • The 1.0.1 tag specifies the version of the image to pull. In this case, it's version 1.0.1.
  1. Navigate to the cloned repository directory in your terminal.
  2. Run the Docker container with the following command:
docker run --rm -p 8787:8787 -e DISABLE_AUTH=true \
 -v ${pwd}/data:/home/rstudio/data nghuixin/infl_marker_analysis:1.0.1
  • docker run is the command to run a Docker container.
  • -rm flag removes the container once it's stopped, ensuring a clean environment.
  • p 8787:8787 flag maps port 8787 on your local machine to port 8787 in the container. This is necessary for accessing the RStudio server running inside the container.
  • e DISABLE_AUTH=true flag disables authentication for the RStudio server, allowing easy access.
  • v ${pwd}/data:/home/rstudio/data flag mounts a subfolder data of the current directory ${pwd} on your local machine to /home/rstudio/data directory inside the container. This enables sharing files between your local machine and the container.
  • nghuixin/infl_marker_analysis:1.0.1 specifies the Docker image to run.
  1. Access the RStudio session via your browser at localhost:8787. This environment comes pre-configured with all necessary dependencies and packages.

Step 3: Replicating the Analysis

  1. Explore the provided datasets and scripts within the repository.
  2. If you have access to the original study dataset infl_231010.csv, you can replicate the analysis performed in inflam_marker_part1.R.
    • This script generates figures and linear regression output based on the original dataset.
  3. If you don't have access to the original dataset, you can still view the output of the analysis (analysis/mod1.rda) and the generated figures.
  4. Additionally, you can use inflam_marker_part2.R to simulate a fake dataset if you don't have access to the original dataset, using the statistical properties of the original dataset. If you do have access to the original study dataset, you can replicate the simulation and the simulated dataset included in the container.
    • This script replicates the analysis performed on simulated data, including figures and linear regression models.
    • The Docker image pre-populates the data folder with the csv files with the statistical properties (e.g., mean, standard deviation) of the original study dataset during build time. These files are not included in the github repo due to privacy concerns. See section about the Dockerfile for further explanation.

Step 4: Collaborating and Sharing via Docker

If you want to share your new analyses added to the original script within the Docker container, you can create a new Docker image with your modifications and share it. Here's how you can do it:

  1. Make your modifications: Within the Docker container, make the necessary changes to the scripts or analysis files.
  2. Commit your changes and create a new Docker image: After making the modifications, commit your changes within the Docker container. You can create a new Docker image that includes these modifications.
  3. docker commit [container_id] [new_image_name

    Replace [container_id] with the ID of the Docker container where you made your changes, and [new_image_name] with the desired name for your new Docker image.

  4. Tag the new image (optional) and push it to Docker hub: You can tag your new Docker image with a version or any other relevant identifier and share it with others via Docker Hub so others can pull and use it.
docker tag [new_image_name] [new_image_name]:[tag]
docker push [new_image_name]

Now we know how reproduce the original analyses with the example above from a Docker image, let’s examine how the image was built, and why we use renv together with it.

Background

Building a Docker image with Dockerfile

The Dockerfile used to build this image is shared on github as well. Let's break down each section and explain what each line in the file means:

  1. Base R image:
  2. FROM rocker/verse:4.3.2

    This line specifies the base Docker image to use as the starting point. rocker/verse:4.3.2 is an image that provides R and various popular R packages, suitable for data analysis tasks. We specify version 4.3.2 here.

  3. Install R dependencies:
  4. RUN R -e "install.packages(c('renv'), repos = 'https://cloud.r-project.org')"

    Here, it installs the renv package using the R package manager. renv is a package management tool for R that helps ensure reproducibility by managing project-specific R libraries.

  5. Set the working directory:
  6. WORKDIR /home/rstudio

    This line sets the working directory inside the Docker container to /home/rstudio. All subsequent commands will be executed in this directory.

  7. Copy the R script and data:
  8. COPY analysis/ ./analysis/
    COPY data/  ./data/
    COPY figures/ ./figures/

    These lines copy the analysis scripts, data files, and figures from the local filesystem into the Docker container. This ensures that all necessary files for the analysis are available within the container. However, to avoid copying the original study dataset file (since that will be manually shared with collaborators who have authorized access), we write data/infl_231010.csv inside .dockerignore .

  9. Copy renv files:
  10. COPY renv.lock renv.lock
    COPY renv renv

    These lines copy the renv.lock file and the renv directory from the local filesystem (which you originally pulled from the github repo) into the Docker container. renv.lock contains the exact specifications of R packages required for the analysis, and the renv directory typically contains additional configuration files for renv.

  11. Restore R packages using renv:
    1. RUN R -e "renv::restore()"

      This command executes within the Docker container and restores the R packages specified in the renv.lock file that is shared on github. It ensures that the correct versions of R packages are installed, maintaining the reproducibility of the analysis environment.

      Freezing the R environment and packages using renv

      renv is a package in R that enables the creation of reproducible environments, ensuring that the computational environment used for an analysis can be easily replicated. When I first created the R scripts in my local machine, I ran renv::init and then renv::snapshot in order to freeze the analyses environment. You do not have to do this in the docker container, because the environment is already restored for you at built time, as we have seen above.

      Key features of renv include:

    2. Package Management: renv allows users to easily record the packages used for a specific project. This information is stored in a file called renv.lock, which lists all the packages and their versions.
    3. Environment Isolation: renv creates a per-project library that is completely isolated from the main R library on the machine. This means that packages installed for one project do not affect packages used in other projects.
    4. Dependency Management: renv automatically detects and records package dependencies, ensuring that all required packages are included in the environment.
    5. Version Control Integration: renv integrates seamlessly with version control systems like Git. The renv.lock file can be shared in a github repo, allowing collaborators to easily replicate the environment used for the analysis.
    6. Easy Restoration: renv makes it simple to restore the exact package library used for an analysis on another machine, in our case, in the Docker container when the image is run.

Resources

A few resources I’ve found helpful while creating this tutorial:

  1. Building reproducible analytical pipelines with R
    • Data scientists write, read and review a lot of code and need to reproduce results from previous projects, but yet we rarely ever get taught the tools needed to make our workflows easy for collaboration — which is truly unfortunate because existing software engineering practices that are common place can be easily implemented to help us data scientists solve our problems! This book goes in-depth into the various DevOp practices for Open Science. On that note, I should perhaps flag a major caveat that Docker, as I’ve come to learn, might not facilitate reproducibility, and we might need to find ways to make the computational environment itself reproducible via other tools like Guix.

2. R Reproducibility toolkit for the practical researcher

  • I like this resource because of its workshop-style organization and they walk through everything step-by-step. Great resource for beginner-friendly resource while also very comprehensive, spanning topics on git, github, renv, Zenodo.
  1. Setting up a transparent reproducible R environment with Docker + renv
    • Written by the same author of the workshop above, I used this resource heavily as a reference for the tutorial I created— given that the goal of this article was to illustrate how to reproduce analyses from a paper (the ultimate goal of open science!).