Renee Hui Xin Ng
Renee Hui Xin Ng

Mental Models (wtih 🤖) for Making Reproducible Interactive Data Prototypes with Snakemake, Docker, CI, and PyShiny

đź“’

Ideally I would have done this projects in a topic or area of my interest (compbio) however, given that my first project relied heavily on the guidance of this repo (which also comes with a YT video) that heavily that utilized data from the Global Historical Climatology Network-Daily Database, I decided to stick with this dataset for the rest of the two projects as well. If this project was in the domain of bioinformatics, it could be something like web app for exploring multimodal data that enables filtering by a certain clinical phenotype’s pathology stage, diagnosis,, cell type, QC metrics - essentially some interface where sceintists can quickly browse through what data exists before downloading.

In that case, Snakemake would orchestrate the reproducible data flow from the raw data into QC summaries, and precomputed metrics; Docker would freeze the environment so everyone runs the same pipeline and app dependencies across laptops, servers, and CI; GitHub Actions would automatically build and test the Docker image, run the Snakemake workflow on sample data or staged files, and publish validated artifacts or deploy the app when changes are pushed; and finally PyShiny would function as the interactive layer hosted on the web. In other words: Snakemake handles reproducible transformation, Docker handles environment consistency, GitHub Actions handles automation/verification/deployment, and PyShiny facilitates exxploration by the scientist who say for example, wants to compare modalities for the same donor or tissue sample.

Goals

  • Learn by implementing a set of tools (GitHub Actions, Snakemake, PyShiny, and cloud computing) that I was introduced to during Neurohackacademy in 2022
  • Develop a clearer mental model of how and why these tools are relevant for researchers by actually implementing them
  • Explore how quickly I can build an MVP using tools like Cursor and how much I can rely on it to understand decision criteria that needs to be considered for software design

In the first project where I built a drought index in Japan displayed on a map I used Snakemake, GitHub Actions, and Shinylive to build a webpage hosted on GitHub Pages In the second project, I deployed a PyShiny app of a really barebones interactive table powered by plotly on Posit Cloud to learn what it’s like to host and manage an app in that environment. In the final project, I containerized a PyShiny app with Docker, stored it in Google Artifact Registry, deployed it via Cloud Run. This PyShiny app has a interactive scatterplot that can be filtered by year.

Docker + Google Cloud Platform

Couple notes on the tech stack used here:

Docker packages both the code and its environment into an image, which can be instantiated as a container and run anywhere. If anyone wants to rerun the code and env with a dataset, they could pull my docker image and spin up their own container to run the code, then they could modify the image and save it as anoher version of the image. The data files are excluded from the Docker image entirely. A WEATHER_DATA_PATH environment variable in app.py tells the app where to find the data at runtime, which is a local path during development, a gs:// path in production.

Google Cloud Storage (GCS) stores the data files independently of your code and could be made accessible to any authorized service. GCS handles data ingest. The csv files are relatively small in this case, so reading from GCS does not affect performance. One could consider caching or using a database for larger datasets in the future.

Artifact Registry stores the docker image. One could think of it as GitHub but for container images (without the collaborative workflows afforded by git however). The image is pushed once after building it locally, and Cloud Run pulls from it on every new deployment. If other users who have access to this registry pull the image they can modify it and save a new copy/version of it.

Google Cloud Run takes the Docker image and serves it as a live URL; it does not incur any compute cost as long as it is idle. Whether the URL is publicly accessible or not depends on the IAM settings in GCP.

In a hypothetical scenario where I would like to update the PyShiny app with new data every month, these are the commands I would run:

gcloud storage cp new_data.csv gs://[bucket-name]/data/processed/japan_weekly_weather.csv

gcloud run deploy scatter-app --region us-central1 ...Schematic representation of description above:

Local machine

├── docker build → creates the image (code only, no data)

├── docker push → sends image to Artifact Registry (GCP's image storage)

└── gcloud storage cp → uploads CSV to GCS bucket

Cloud Run does the following when a user visitss the URL, otherwise it stays idle

└── pulls image from Artifact Registry

└── app starts, reads CSV from GCS using env vars

└── serves the dashboard

More human control via plan mode in Cursor

Previously, I had just used Agent or Ask mode in Cursor with spec.md in the second project that displays an interactive data table. However, it still resulted in a lot of back-and-forth changes to the spec.md, which nonetheless still was helpful for keeping track of the changes made to the codebase, compared to scrolling through the chat.

Using plan mode with cursor rules allowed me to review the changes stepwise, and if new rules emerged that were generalizable to the entire codebase, I could add them to the cursor rules. In this case, I ran into a persistent issue that I had also encountered during the second project. I had initially intended to include a scatterplot in the second project (which now only displays a data table), but given how much code was generated at once, and the iterations that involved changing many files, it was difficult for me to debug or identify the source of the issue, especially without prior knowledge or experience coding PyShiny from scratch.

Plan mode helped me identify the problem quickly, since when things were implemented stepwise, the number of files to review were fewer. This incremental addition of new files enabled to narroww down the cause of the lack of dots on the scatterplot more quickly. Eventually, the agent was able to identify that the problem had to do with incompatible dependencies related to the plotly and shinywidgets packages.

As a result, I decided to include the following rules in the cursorrules file that are generalizable beyond specific packages. However, I did find it difficult to decide when the language becomes too vague (i.e., how would the agent interpret what is “too small”?):

In any case, I significantly prefer this style more than spec-driven development.; partially because specs seem also subject to too much change given background knowledge (or lack thereof). Letting the agent have more agency, as opposed to defining more specific requirements upfront, works better for me after all. It also helps me learn stepwise what gets added into the architecture progressively and allows me more control or supervision over what the agent is doing at each step.

Notes on Github actions and Snakemake

Github actions is not entirely relevant here given that the code is assumed to never change nor do I need regular rebuilding of the Docker image. (Although it is possible to include github actions for regular scheduled runs to check for OS vulnerabilities or system package vulnerabilities apt-get install in the base image)

the case that the code base is updated to add new features to the PyShiny app frequently, then setting up github actions would make even more sense to rebuild the docker image and redeploy every time it is updated.

Snakemake is also irrelevant here since it is a pipeline orchestrator - if this project involves pulling data from the source and aggregating and writing a csv file, then snakemake would make sense. For the sake of practice I’ve decided to assume in this scenario that the data in GCS will be updated monthly and manually; although I could extend the logic from the snakemake pipeline from the first project, where the data is generated via a standardized pipeline.

In the following scenario, adding snakemake and github actions would make sense.

đź’ˇ

data ingest from source → CSV output → rebuild app and image → deploy app

# Enable APIs
gcloud services enable run.googleapis.com artifactregistry.googleapis.com

# Create image named "scatter app"
gcloud artifacts repositories create scatter-app --repository-format=docker --location=us-central1

# Authenticate Docker with GCP
gcloud auth configure-docker us-central1-docker.pkg.dev

# Build and push image (run from local project folder)
cd "c:\Users\...\tool-dev\reproducible-scatter-plot"
docker build -t us-central1-docker.pkg.dev/project-[alpha-numeric-seq]/scatter-app/scatter-app:latest .
docker push us-central1-docker.pkg.dev/project-[alpha-numeric-seq]/scatter-app/scatter-app:latest

# Get project number for IAM, which is different from Project ID
gcloud projects describe project-[alpha-numeric-seq] --format="value(projectNumber)"

# Grant Cloud Run access to GCS (replace 123456789 with output from step 5)
gcloud storage buckets add-iam-policy-binding gs://average-min-max-temperature-scatterplot-app `
  --member="serviceAccount:123456789-compute@developer.gserviceaccount.com" `
  --role="roles/storage.objectViewer"

# Deploy
gcloud run deploy scatter-app `
  --image us-central1-docker.pkg.dev/project-[alpha-numeric-seq]/scatter-app/scatter-app:latest `
  --region us-central1 `
  --allow-unauthenticated `
  --memory 2Gi ` # Deploy to Cloud Run with 2GB of memory
  --set-env-vars "WEATHER_DATA_PATH=gs://average-min-max-temperature-scatterplot-app/data/processed/japan_weekly_weather.csv" `
  --set-env-vars "STATIONS_DATA_PATH=gs://average-min-max-temperature-scatterplot-app/data/processed/japan_weekly_weather_stations.csv"

- When a library wraps or serializes another (e.g. UI framework + widget bridge + rendering backend), treat the combination as one versioned stack. 
- Changing one package without checking the others is a first-class risk.
- Prefer explicit pins for packages that sit on compatibility boundaries (major upgrades, optional extras, or packages that compile against a specific ABI/API). 
- Open-ended lower bounds alone are not enough unless CI or a lockfile proves compatibility.
- Prefer the documented integration path (official decorator/output pair, adapter, or plugin) over ad hoc mixing of APIs that “look similar.”
- When debugging “code looks correct but nothing shows / wrong state,” check dependency and API mismatch before refactoring application logic. However, do not introduce extra layers for small tasks unless complexity justifies it.