Reproducibility with GitHub Actions for running Snakemake pipelines

Automating Reproducible Workflows

My goals for this is learning how to use snakemake and github actions, and I heavily relied on this repo and YT video as reference. The data is from Global Historical Climatology Network daily (GHCNd), which is an integrated database of daily climate summaries from land surface stations across the world. The end product is the result of a series of prompts I entered into Cursor, so I can’t say I made informed choices for the architecture, other than knowing I wanted to use a combo of snakemake, github actions and shiny. My goals were two-fold:

Use snakemake and github actions to run a reproducible pipeline/workflow automatically daily and display the visualization/map using PyShiny
Test how efficiently can I use AI coding tools to create what I envision with the minimal knowledge I have about the individual components involved

What does Snakemake + Github achieve

Snakemake is responsible for defining the logic of the data pipeline: what files need to be created, what steps produce them, and how those steps depend on each other. Instead of manually running scripts in the right order, you declare relationships like “to build a webpage index.html, I need this plot; to build the plot, I need this processed dataset,” and Snakemake figures out the correct execution order automatically. It also avoids unnecessary work: if an input hasn’t changed, it won’t recompute downstream steps. GitHub Actions, on the other hand, is responsible for automation and execution. It answers questions like: When should this pipeline run? On which machine? With what environment? And what should we do with the results afterward? In practice, GitHub Actions provisions a clean virtual machine, installs dependencies, and then calls Snakemake to execute the pipeline. After Snakemake finishes, GitHub Actions can take additional steps such as committing outputs back to the repository (original example) or deploying a website (using Github pages in my example)

Snakemake defines the pipeline; GitHub Actions runs it on a schedule or trigger and handles the outputs. Snakemake then builds a dependency graph (a DAG) and executes all required intermediate steps in the correct order by fetching data, transforming it, and generating final artifacts. GitHub Actions does not need to know how those steps work; it only needs to invoke Snakemake and then decide what to do with the results.

Interactive UI with Map of Precipitation in Japan 🇯🇵

In my project, I used Snakemake for the data pipeline, Shiny for Python (shiny + sh= inywidgets) for the app, Plotly (plotly.express) and Mapbox tiles for the interactive map. Snakemake rules fetch csv.gz files, build manifests and metadata, precompute monthly precipitation index (data/monthly/japan_monthly_prcp.csv) and creates a summary file with the latest daily precipitation (data/latest/japan_latest_prcp.csv), then copy small, app-ready csv files into code/app_data/ via build_app_bundle.py.

code/app.py loads everything from code/app_data/, computes z‑scores for the selected month and year for a given location, and renders the map and sidebar UI.

The GitHub Actions CI runs snakemake app_data, sanity-checks app_data outputs, and the deploy workflow runs sninylive export code build and publishes the static Shinylive bundle to GitHub Pages. Shinylive turns the app into a static site so GitHub Pages can host it with no server to maintain, unlike Shiny Server, which needs a running VM/container.

The whole pipeline (data build + app export) is fully defined in the repo and run by GitHub Actions; the deployed site is just the static output of snakemake app_data + shinylive export, which is ideal for a reproducibility. However, the outputs are not directly visible in the repo unless explicitly saved (which is not the case at the moment). Debugging (e.g., figuring out if the csv files created daily automatically are truly the latest/most up-to-date, require inspecting build logs rather than files). However, there is clean separation between code and generated data outputs, so the repository stays lightweight.

Essentially, Snakemake produces intermediate data products, like the precomputed monthly average precipitation, then the Shinylive build step turns the app into a static site, and GitHub Actions deploys the built site to GitHub Pages.

Difference between original static site and Shinylive webapp

The original inspiration for this project had slightly different goals, since it produces a non-interactive map in png format. These outputs are small and self-contained - since the map is displayed on index.html, the repo is the website. There is no separate deployment workflow needed as in my case. GitHub Actions commits the map image and the rendered html back into the repository, and GitHub Pages serves them directly from the repo. Given that outputs are simple the downsides of committing to the repo are few since we don’t have to worry about too many artifacts polluting the git history. However, this configuration requires more care with push permissions and workflow triggers. It can fail if there were no changes (no new data) and git commit exits 0 (though there are ways to force commit anyhow). Ultimately in the Shinylive app case, we have a clean separation between source code and the deployable output.