Part III: Reproducible analyses in R with Docker

How to save output after running new analyses in your docker container

Recently I gave a brief presentation introducing docker to fellow scientists about how to create a reproducible environment the analyses completed and to facilitate sharing between lab members. Aming the questions that came up were concerns over data privacy and security (which I have addressed them here), and issues with persisting outputs and new analyses added to the original script that came with the image. So for instance, I’ve shared an image (pull it here) to Docker hub and my colleague pulls it and runs the container to reproducible my results. But they decide to add some new analyses to the script (see code comments), and now they would like to save the newly added code and its output.

Snippet of R script that you can find by pulling the docker image I created:

# Load necessary libraries
library(readr)   # For reading CSV files
library(plyr)    # For data manipulation
library(tidyverse)  # For data wrangling and visualization
library(lme4)    # For linear mixed-effects models
library(car)     # For diagnostic plots
library(ggplot2) # For visualization
library(nlme)    # For fitting mixed-effects models

# Read the CSV file
data <- read_csv("data/infl_231010.csv")

# A bunch of data cleaning steps:
# ....
# ....

complete_data <- data[....]

# --- Analysis ---
# Fit the mixed-effects model using lme4
mod2 <- lme(lgvegf ~ time * (agem * dxgroup + gender) + dxgroup * (gender) ,
            random = ~ time | subnum, 
            data = complete_data)

# Print model summary
summary(mod2)

# --- Plots ---
# Create a ggplot to plot the data and fitted values
ggplot(complete_data, aes(x = time, y = lgvegf, color = dxgroup)) +
  geom_point(size = 0.9) +  # Add points for the observed data
  geom_smooth(method = "lm", se = FALSE) +  # Add regression line without confidence interval
  labs(x = "Time", y = "lgvegf", color = "dxgroup") +  # Labels
  theme_minimal()  # Theme

#### ----  NEW analyses and plots that were NOT already part of the origina image and container  -------
# Fit the mixed-effects model using lme4
mod3 <- lme(lgvegf ~ time * (agem * dxgroup + gender) ,
            random = ~ time | subnum, 
            data = complete_data)

# Print model summary
summary(mod3)

# Create a ggplot to plot the data and fitted values
jpeg('figures/plot3.png')
ggplot(complete_data, aes(x = time, y = fitted_values, color = gender, linetype = dxgroup)) +
  geom_point(size = 0.9) +  # Add points for the observed data
  geom_smooth(method = "lm", se = TRUE) +  # Add regression line without confidence interval
  labs(x = "Time", y = "Predicted lgvegf", color = "dxgroup") +  # Labels
  theme_minimal()  # Theme
dev.off()

print('analyses completed')

Option 1: Create a new Docker image

The most common approach involves committing changes made within a container to a new image. This can be done using the docker commit command, which creates a new image that includes the changes. For example:

docker commit <container_id> <new_image_name>:<tag>
docker commit nghuixin/infl_marker_analysis:1.0.0  soohyun/infl_marker_analysis:1.0.0

Upon running the new docker container with docker run soohyun/infl_marker_analysis:1.0.0, the new code will be visible in the R script, while preserving the same libraries and versions, and will produce the expected outputs.

Option 2: Save the modified script and/or analysis results outside of Docker container

Save the modified script to your local machine

You can do so by running the following commands:

docker container ls
CONTAINER ID   IMAGE                                             COMMAND   CREATED         STATUS             PORTS                             NAMES
5cb1dedcc204   nghuixin/infl_marker_analysis:1.0.0   "/init"   About an hour ago   Up About an hour   0.0.0.0:8787->8787/tcp   eager_poincare

docker cp 5cb1dedcc204:home/rstudio/analysis/infl_marker_huixin.R ./container_r.R

5cb1dedcc204 is the container id which can be obtained by running docker container ls, and ./container_r.R is the new R script with the added lines of code. It is now saved in the root directory of the project on the hosts machine.

🚨 However, this is not recommended, because once the file is saved outside of the Docker container, then there is no guarantee that the results of the analyses will be replicable given that the versions of R and associated libraries might not be the same the local machine.

Save the analyses results to your local machine (text output)

If or some reason, you wish to only save the output like summary(mod3) above, then you can either save your results output by using sink()

#### ----  New analyses that were NOT already part of the container  -------
sink('new_analyses_output.txt')
# Fit the mixed-effects model using lme4
mod3 <- lme(lgvegf ~ time * (agem * dxgroup + gender) , random = ~ time | subnum, data = complete_data)

# Print model summary
summary(mod3)
print('analyses completed')
sink()

Next, you can copy the txt file output from the container to your local machine.

docker cp <container_id>:/path/to/container/file /path/to/local/destination
docker cp 5cb1dedcc204 :/home/rstudio/new_analyses_output.txt /new_analyses_output.txt
new_analyses_output.txt can be seen in the home/rstudio folder.
new_analyses_output.txt can be seen in the home/rstudio folder.

Save the analyses results to your local machine (image output)

You can also do the same for the plots you created. For instance, if there isn’t already a figures directory in this Docker container you can create it manually just as you would on your local machine R studio, or run dir.create('figures') :

Create a folder in your R project folder in the Docker container.
Create a folder in your R project folder in the Docker container.

Then run the code for creating a new plot:

# Create a ggplot to plot the data and fitted values
jpeg('figures/plot3.png')
ggplot(complete_data, aes(x = time, y = fitted_values, color = gender, linetype = dxgroup)) +
  geom_point(size = 0.9) +  # Add points for the observed data
  geom_smooth(method = "lm", se = TRUE) +  # Add regression line without confidence interval
  labs(x = "Time", y = "Predicted lgvegf", color = "dxgroup") +  # Labels
  theme_minimal()  # Theme
dev.off()
plot3.png is now in the figures folder
plot3.png is now in the figures folder
🚨
Creating a folder manually is generally discouraged for several reasons, such as the risk of data loss or corruption during the export process, the challenge of managing folders/data without proper version control, and the potential for increased consumption of system resources. However, I decided to show hhe process using point-and-click illustrations of navigating the rstudio server, here because, as scientists, our focus is typically on applying the correct analysis technique to the subject of study, rather than mastering a tool like Docker. The above point-and-click illustrations serve as an intuitive guide for navigating the Docker and Rstudio server environment.