Part I: Reproducible analyses in R with Docker

Part I: Reproducible analyses in R with Docker

šŸ“Œ
This article assumes that you:
  • attempted the docker 101 tutorial (i.e., know basic commands like build, run, kill, stop)
  • know the difference between an image and container tried out multiple examples on how to integrate use rocker together with Docker to reproduce your R analyses
  • want to learn how to interact with your analyses in Rstudio in a docker container, not just run your .Rmd file or .R script to get an output!
  • donā€™t expect a perfect tutorial with perfect examples! This article is very much a WIP just as my understanding of Docker šŸ™‚

Locking down a reproducible environment in R for future reference and analyses

If your goal is to create a reproducible environment for your R markdown file (which may include producing file types like .csv and .xlsx and .jpeg) using graphic devices or libraries like xlsx, read on:

# Start with R version 4.3.2
FROM rocker/verse:4.3.2

# Set the working directory to /home/rstudio
WORKDIR /home/rstudio

# Install some Linux libraries that R packages need
RUN apt-get update && apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libgsl0-dev libhdf5-dev  libproj-dev libgdal-dev libudunits2-dev

# Install additional Linux packages needed for R functionality
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    wget \ 
    graphviz \ 
    texlive-latex-extra \ 
    lmodern \ 
    perl   \
    pandoc

# Install the 'remotes' package from CRAN using R
RUN R -e "install.packages('remotes', repos = c(CRAN = 'https://cloud.r-project.org'))"

# Copy necessary files from the local directory to the container
COPY renv.lock renv.lock
COPY renv/activate.R renv/activate.R
COPY .Rprofile .Rprofile

# Update package list and check system requirements using sysreqs
RUN sudo apt update \
 && R -e "system(sysreqs::sysreq_commands('DESCRIPTION', 'linux-x86_64-ubuntu-gcc'))" \
 && apt install -y libmagick++-dev

# Install R packages from CRAN with 'install2.r'
RUN install2.r  --error --skipinstalled --ncpus -1 \
 rmarkdown yaml dplyr tidyverse DescTools ggplot2 readxl lme4 ModelMetrics merTools lmtest renv ech emmeans haven lmerTest metaforest rstatix ggthemes scales pandoc labelled Matrix \
    && rm -rf /tmp/downloaded_packages

# Change ownership of the directory to the 'rstudio' user and activate 'renv'
RUN chown -R rstudio . \
 && sudo -u rstudio R -e 'source("renv/activate.R");  renv::restore()'

A little context for this article - I use R for statistical analyses like simple regressions, correlations and linear mixed models, and for plotting and creating figures. Occassionally I need to save some output as a csv file. So the dependencies installed here would cover the aforementioned usecases.

Attempt 1

In this dockerfile I specified rocker/rstudio:4.3.2 , which is essentially a rocker image with R version 4.3.2. Rocker provides several Docker images that are useful for R programming. There are other options for more complex images like tidyverse and verse. You need to find the image that best suits your needs as the base image.

Next we install the system dependencies (linux packages required for R functionality). Although I do not have any spatial data, for some reason I could not get the image built without linux packages like libproj-dev .

libcurl4-openssl-dev: Required for web data handling and secure network communication.
libssl-dev: Enables Secure Sockets Layer (SSL) encryption for secure network connections.
libxml2-dev: Used for parsing and working with XML data.
libgsl0-dev: Provides mathematical functions for scientific and numerical computing.
libhdf5-dev: Supports storage and manipulation of large datasets in HDF5 format.
libproj-dev: Used for cartographic projections and transformations in geospatial applications.
libgdal-dev: Enables reading and writing of geospatial data formats through GDAL.
libudunits2-dev: Supports unit conversion for scientific and geospatial applications.
wget: Command-line tool for downloading files from the internet.
graphviz: Used for creating graphs and diagrams.
texlive-latex-extra: Provides additional LaTeX packages for document formatting.
lmodern: Modernized Latin Modern font for LaTeX documents.
perl: Required for specific text processing tasks.
pandoc: Used to convert documents between different formats, often used with Rmarkdown.

Next I install the remotes package because it is essential for managing and installing other R packages from remote sources during the container setup.

We then copy the renv.lock , renv/activate.R files and .Rprofile files. The first two are related to the renv package in R. I installed renv in my Rstudio because it records the precise versions of R packages that was used in the project, and it installs them into a project-specific directory as well. It is typically stored in the renv subdirectory of the project. As a result, this achieves isolation of the packages - if I update my packages globally, I can be certain that these project-specific ones will not be affected. The .Rprofile is a config file which I did not modify.

In my project folder, I also included a DESCRIPTION file that contains all the r packages that are required for the project. And, as a backup I also added another line, that uses the command install2.r . install2.r --error --skipinstalled --ncpus -1 installs R packages from a repository, raising errors for any installation failures, skipping already installed packages, and using all available CPU cores for faster installation.

# Update package list and check system requirements using sysreqs
RUN sudo apt update \
&& R -e "system(sysreqs::sysreq_commands('DESCRIPTION', 'linux-x86_64-ubuntu-gcc'))" \
&& apt install -y libmagick++-dev

# Install R packages from CRAN with 'install2.r'
RUN install2.r  --error --skipinstalled --ncpus -1 \
rmarkdown yaml dplyr tidyverse DescTools ggplot2 readxl lme4 ModelMetrics merTools lmtest renv ech emmeans haven lmerTest metaforest rstatix ggthemes scales pandoc labelled Matrix \
&& rm -rf /tmp/downloaded_packages

This dockerfile is far from perfect, and as you see in the chunk above, it contains redudancies in terms of the r package installations, but Iā€™ve chosen to share and discuss it here ā€” in spirit of learning and thinking in publicly. The whole chunk of code above might even not be necessary altogether, which I will explain why in the section below.

This final chunk of code is important for changing the user to ā€œrstudioā€ when running the container with the rocker Rstudio server, because launching Rstudio on the server would prompt it to ask for a username and password.

RUN chown -R rstudio . \
 && sudo -u rstudio R -e 'source("renv/activate.R");  renv::restore()'

Now what should follow is building the docker file by doing:

docker build . -t username/project_name

šŸšØ
This is where everything starts to fall apart šŸ« šŸ« šŸ« 
33.28 > system(sysreqs::sysreq_commands('DESCRIPTION', 'linux-x86_64-ubuntu-gcc'))
33.28 Error in loadNamespace(x) : there is no package called ā€˜sysreqsā€™
33.28 Calls: system ... loadNamespace -> withRestarts -> withOneRestart -> doWithOneRestart
33.28 Execution halted
------
Dockerfile:37
--------------------
  36 |     # Make sure 'sysreqs' is installed in your R environment as suggested earlier
  37 | >>> RUN sudo apt update \
  38 | >>>  && R -e "system(sysreqs::sysreq_commands('DESCRIPTION', 'linux-x86_64-ubuntu-gcc'))" \
  39 | >>>  && apt install -y libmagick++-dev
  40 |
--------------------
ERROR: failed to solve: process "/bin/sh -c sudo apt update  && R -e \"system(sysreqs::sysreq_commands('DESCRIPTION', 'linux-x86_64-ubuntu-gcc'))\"  && apt install -y libmagick++-dev" did not complete successfully: exit code: 1

I get this error above when I attempt the bulild with this dockerfile ā€” while I know it has something to do with the library sysreqs , I have not been successful at resolving the issue after multiple attempts. So I decided to take a different approachā€¦

Attempt 2

I decided to run the docker container by creating a detached container named rs_server using based on the rocker/verse:4.3.2 image. I also mount my entire project directory onto the container.

docker run --name rs_server  -e DISABLE_AUTH=true --rm -d -p 8787:8787   -v ${pwd}:/home/rstudio -v /home/rstudio/data  -v /home/rstudio/lisa_ppt -v /home/rstudio/knitted_rmd rocker/verse:4.3.2

Now if you open your browser and go to localhost:8787, you will see all the folders from the project. But letā€™s breakdown the command above:

  1. docker run: This is the Docker command used to create and run a new container from a specified Docker image.
  2. -name rs_server: This option specifies a name for the container. In this case, the container will be named "rs_server."
  3. e DISABLE_AUTH=true: This option sets an environment variable within the container. In this case, it sets an environment variable named "DISABLE_AUTH" to "true." This allows us to bypass the request for username and password when we open up localhost:8787.
  4. -rm: This option indicates that the container should be automatically removed when it stops running.
  5. d: This option runs the container in detached mode, which means it runs in the background and doesn't block the terminal. It allows you to continue using the terminal while the container runs.
  6. p 8787:8787: This option specifies port mapping. It maps port 8787 from the host to port 8787 within the container. We use this to run services like RStudio Server.
  7. v ${pwd}:/home/rstudio/: This option also mounts a volume, but it uses a dynamic variable ${pwd} to represent the current working directory on the host. It mounts the current working directory on the local computer to "/home/rstudio/" within the container. This is a common practice for sharing project files with a container.
  8. v /home/rstudio/[FOLDER NAME] simply specifies that an empty folder with the same name is created in the container. Elioā€™s blogpost also mentions this as well. I had mistakenly thought that by mounting the parent directory and pushing it to dockerhub, I would be sharing all the datafiles and scripts together by committing it to an image.
  9. rocker/verse:4.3.2: This is the Docker image from which the container is created. It specifies the image named "rocker/verse" with the tag "4.3.2." Docker images are typically used as templates to create containers.
localhost: 8787 will display the files in the Rstudio server.
localhost: 8787 will display the files in the Rstudio server.
Video example of how to get everything up and running locally. You can download the files rendered in the image by going to the ā€˜containersā€™ tab in Docker and navigating to its path.

A somewhat successful attempt

To actually create a docker image that is sharable:

  1. Make sure the Docker container is stopped if it's running, and commit the running container to an image:
  2. docker stop rs_server
    docker commit rs_server your-image-name:your-tag

    Replace your-image-name and your-tag with the desired name and tag for your image.

  3. Tag the image with the desired registry address (aka your username, if you wish). I included a tag for version-control purposes.
  4.  docker tag your-image-name:your-tag registry-address/your-image-name:your-tag

    Replace registry-address with the address of your chosen container registry.

  5. Log in to the container registry: If you're pushing the image to a container registry, log in using the docker login command and provide your credentials.
  6. Push the image to the registry:
  7.  docker push registry-address/your-image-name:your-tag

    Now the image is available for everyone to use on Docker hub! šŸ’«

    Note that the data and R script files in the current working directory (${pwd}) are not automatically included when you push an image to a container registry! Because those files were not included during build time, given that Attempt 1 above failed. For someone to replicate the environment in its entirety, theyā€™d have to get access to the data and the script first, whether thatā€™s via a github repo or their collaborator sending the files via email, and then run very similar commands from attempt 2 in the terminal, in this instance replaced with the actual names of the registry, image and tag, which is now available Docker hub:

    docker pull nghuixin/enigma_bd_brainage:1.0.0
    docker run --name rs_server -v ${pwd}:/home/rstudio/project -e DISABLE_AUTH=true --rm -d -p 8787:8787  -d nghuixin/enigma_bd_brainage:1.0.0
    šŸšØ
    Yet another caveat... This docker environment from nghuixin/enigma_bd_brainage:1.0.0 isnā€™t exactly the same as the one I wrote my scripts in because it does not come with the R packages I have installed (e.g., "lmtest", "emmeans", "havenā€), which is a crucial part to reproducing the results! Users can install those packages by just doing install.packages() in the Rstudio browser, but wouldnā€™t it be easier to just have all those packages attached once you open up the browser? This will be addressed in a Docker tutorial up next!
I was not able to attach/mount the library which contains all my installed libraries from my computer to the container
I was not able to attach/mount the library which contains all my installed libraries from my computer to the container

Resources

Various blogs and articles that I have found useful during my attempts to create a docker image with Rstudio:

Addendum

@February 26, 2024 Edit:

To make it clear, the code written above was not a good solution at all for creating a reproducible environment, since:

  • The user of the container would have to obtain the data and R script files from somewhere else anyway, because the data and R script files were not copied cover to the container during build time
  • The user of the container would not have the same versions of the R packages used for the analyses immediately upon opening up the Rstudio browser. Again, this is due to the fact that the project state was not restored (using renv) at build time of the docker container.
  • As we have seen above, the building the docker file had failed, so I had resorted to experimenting with creating a detached container rs_server and then committing it to a new image nghuixin/enigma_bd_brainage:1.0.0 . Next I pulled that new image and ran these commands to test out if the script and the data were in the container:
docker run --name rs_server  -e DISABLE_AUTH=true --rm -d -p 8787:8787  -d nghuixin/enigma_bd_brainage:1.0.0

But as it turns out once -v ${pwd}:/home/rstudio/project is omitted, there are no datafiles or scripts in the container, except for the empty folders we had specified in the run command earlier.

image