Creating reproducible analyses… when you’re not the author of the code

DevOp practices for Scientific Computing

People who analyze data (i.e., researchers, statisticians, analysts) often write, review and read lots of code. Sometimes they need to reproduce the same exact results, sometimes they need to build on top of the existing code. Yet, shockingly, they rarely ever get trained on how to do so.

Yet another group of people (i.e., software engineers) who also often write, review and read lots of code have already developed tools and processes to make the process of writing, collaborating, reviewing and reproducing projects.

I've learned this the painful way — it is tempting to think that it doesn't matter if you work alone because 'no one else else will read my code anyway'. But the truth is that other person is probably ... your future self.

Enter DevOps, a set of practices that originated in the software development industry but hold great promise for scientific computing. DevOps, short for Development and Operations, are a set of practices that help streamline the process of software development, from initial coding to deployment and maintenance. At its core, DevOps emphasizes collaboration, automation, and continuous improvement, with the goal of delivering high-quality software more efficiently and reliably.

In the realm of scientific computing, adopting DevOps practices means utilizing tools and processes often used by software engineers to improve code development, collaboration, and reproducibility in research contexts. I wish I had adopted these practices sooner.

Over the years, I encountered several issues that could have been easily resolved:

  • I executed R scripts that, upon rerunning, produced results differing by a few decimal points.
  • I attempted to run other people's scripts on my computer, only to be hindered by version incompatibilities.
  • While I could operate a third-party ML model on my computer, I was unable to assist my collaborator in executing it on theirs.

Researchers especially don’t think of themselves as developers — though subset of researchers are package developers that serve specific needs (e.g., bioinformatics). However, as long as one is writing code regardless of goal (e.g., create a UI or bioinformatics pipeline), you’re gonna want to improve the way you set up projects and write code.

You have to see yourself as a developer.

It’s always far easier when you have to create a project from scratch and you’re the sole author. It’s far more difficult to try to make someone else’s pipeline reproducible, simply because it’s likely as an end-user you don’t know how they build model or all the potential dependencies that are required for the model.

But there always comes a time you want to make reproducible a pipeline that someone else created.

Case study time.

In this example, I will containerize an ML model I’ve been using for research, so that no matter what OS I’m using I’ll be able to run the model and reproduce its results.

Containerizing an existing Machine Learning model

From the outset, it seems fairly straightforward: download this model for predicting age from a brains scan, install the dependencies, run it!

git clone https://github.com/MIDIconsortium/BrainAge.git
pip install -r requirements.txt
python run_inference.py --project_name [folder name] --csv_file [path/to/file] --skull_strip --sequence t1

Unfortunately, it wasn’t as straightforward and there were missing dependencies that I had to install by figuring out through the errors.

This would have been easily circumvented if it was containerized to begin with. To run it locally, and to avoid conflicts with other python versions needed for other projects, I used pyenv, which I discuss separately in this article.

Now assuming everything is now perfectly running on your local machine and you want to containerize it. How do you go about reverse engineering the Dockerfile?

Start with some educated guesses — the rest is trial and error.

FROM python:3.6

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

# Set working directory
WORKDIR  /app

# Copy the requirements file
COPY /requirements.txt /app/requirements.txt
COPY . /app/

# Install CMake and clone ANTsPy repository from GitHub
RUN apt-get update && apt-get install -y cmake git
#    && git clone https://github.com/ANTsX/ANTsPy /app/ANTsPy \
 #   && python3 /app/ANTsPy/setup.py install

RUN pip install -r requirements.txt

The key part I struggled with was installing the ANTsPy package, even on my local machine. I wasn’t able to directly pip install it via PyPI so I needed to explore alternatives.

My first attempt was installing ANTsPy at build time via the Dockerfine. But for reasons I still could not figure out, when I attempted to run the model, it complained that ‘ants’ module not found, and the user of the Docker image would still have to run python3 setup.py install before running the model. Installing ANTsPy takes a long time due to ‘wheel-building’. After ruling out that it had nothing to do with the name of the module (ants vs. antspyx), I decided to try to retrieve the package directly from git, just as we did for HD-BET.

How did I know include cmake and git for installation in the Dockerfile?

Again, trial and error — while building the Dockerfile, the error messages showed that I was missing certain dependencies before the rest of the file could be run.

# Install CMake and clone ANTsPy repository from GitHub
RUN apt-get update && apt-get install -y cmake git
#  && git clone https://github.com/ANTsX/ANTsPy /app/ANTsPy \
#  && python3 /app/ANTsPy/setup.py install (this did not work and required the image user to reinstall it after pulling)

Similarly, I had to update the requirements.txt file from what the authors had shared originally in the repo, to include the packages required by ANTsPy.

git+https://github.com/MIC-DKFZ/HD-BET
monai==0.4.0
nibabel==3.2.1
matplotlib==3.3.3
numpy==1.19.4
pandas==1.1.5
torch==1.7.1
xlrd==1.2.0
tqdm==4.62.3

# --- required by ANTsPy ----
git+https://github.com/antsX/ANTsPy
requests 
sklearn
pyyaml
chart_studio
statsmodels
webcolors
#antspyx==0.3.2

Now I’ve built the Docker image and pushed it to the registry:

 docker build -t nghuixin/midi_brain_age:2.0 .
 docker push nghuixin/midi_brain_age:2.0 .

… we’re ready to pull it so we can run it:

docker pull  nghuixin/midi_brain_age:2.0
docker run -it -v $(pwd)/T1/:/app/T1/ -v $(pwd)/subject_id_file_path.csv:/app/subject_id_file_path.csv nghuixin/midi_brain_age:2.0 /bin/bash    
  • This command pulls the Docker image named nghuixin/midi_brain_age with the tag 2.0 from the Docker Hub registry nghuixin.
  • This command runs a Docker container using the specified image nghuixin/midi_brain_age:2.0. Let's break down the options used:
    • it: This option combines two flags i (interactive) and t (pseudo-TTY), allowing you to interact with the container's shell.
    • v $(pwd)/T1/:/app/T1/: This option mounts the local directory $(pwd)/T1/ (where $(pwd) represents the current directory) to the /app/T1/ directory inside the container. This is known as a bind mount and allows the container to access files or directories from the host system.
    • v $(pwd)/subject_id_file_path.csv:/app/subject_id_file_path.csv: Similarly, this option mounts the local file $(pwd)/subject_id_file_path.csv to /app/subject_id_file_path.csv inside the container.
    • nghuixin/midi_brain_age:2.0: Specifies the Docker image to use for creating the container. In this case, it's nghuixin/midi_brain_age with the tag 2.0.
    • /bin/bash: This specifies the command to run inside the container. In this case, it starts an interactive Bash shell, allowing you to execute commands and interact with the container's filesystem.

Resources