Part II: Reproducible analyses in R with Docker

Part II: Reproducible analyses in R with Docker

📌
This article is a follow up to Part I where I illustrate my attempts in making my analyses in R reproducible. I ran into two issues in my previous attempt, which I have now resolved and will discuss below:
  • Even after installing the required libraries for the analyses via the Dockerfile RUN R -e install.packages(...), they did not seem to appear
  • I wasn’t able to mount the installed libraries on my local computer/host machine to the container library; every time I ran .libPaths()` it only showed the libraries that came with the rocker/verse image

Learning Docker

When I set out to share R analyses in a reproducible manner with my co-worker, I did not have the goal of learning docker containerization or how volume mounting works specifically — my goal was to simply to solve the problem at hand. But this is the nice part about project-driven learning, I’ll inevitably learn how to do X in order to serve my main goal — advancing science and doing reearch!

Copy R script(s) to the container via specification in Dockerfile

# Base R image
FROM rocker/verse:4.3.2

# Install R dependencies
RUN R -e "install.packages(c('readr', 'plyr', 'tidyverse', 'lme4', 'car', 'nlme', 'ggplot2'))" 
# nlme already part of verse though

# Copy our R script to the container
COPY /analysis/infl_marker_huixin.R /home/rstudio/analysis/infl_marker_huixin.R

# Set the working directory
WORKDIR /home/rstudio

Understanding Docker Volume Mounting

Docker volume mounting is a feature that allows you to link a directory on your host machine with one inside a Docker container. However, it can also lead to unexpected behavior if not used correctly. In this article, I’ll clarify two common misconceptions about Docker volume mounting.

Issue: Overriding Files in the Container

When you mount a volume with Docker, it replaces the contents of the directory in the container with the contents of the directory on your host machine. This can be problematic if you mount a directory that contains files with the same names as files in the container because those files will be overridden by the ones from the host. This is exactly the mistake I made below - all the folders and files from the directory ${pwd} appeared when I opened Rstudio, which means it was successfully mounted onto the container, but /analysis/infl_marker_huixin.R was nowhere to be found:

docker run -it -e DISABLE_AUTH=true -p 8787:8787 -v ${pwd}:/home/rstudio nghuixin/infl_marker_analysis:1.0.0

Essentially, any files or directories present in the current directory on my host machine will replace the existing contents of the /home/rstudio directory in the container.

Solution

To avoid overriding the existing files in the container, we can mount the host directory to a different location within the container. For example, instead of mounting to /home/rstudio, mount it to a subdirectory like /home/rstudio/data.

Here's how you can do it:

docker run -it -e DISABLE_AUTH=true -p 8787:8787 -v ${pwd}/data:/home/rstudio/data nghuixin/infl_marker_analysis:1.0.0

🚨
But wait does mounting my data folder to the container automatically shares my data or somehow makes it publicly available?!

Issue: Misunderstanding About Data Sharing

I had this misconception is that mounting a local directory into a Docker container automatically pushes its contents to Docker Hub or other cloud repositories.

Clarification

It's important to understand that mounting a local directory only links a directory on your host machine with one within the container. It does not trigger automatic uploads to the cloud or faciitate sharing.

To share the data files, you must explicitly upload them using Docker commands or sync with a cloud storage service like AWS S3 or Google Cloud Storage. Volume mounting alone doesn't achieve this.

I was able to double check by going to the Containers tab to confirm that the data folder was empty:

The data folder is empty when we check its contents in the container.
🚨
One of the points that came up during discission with lab members was, now that we can pull the image you have created — how can I save the results of my analyses that are on browser?

That will probably be saved for Part III!