mushin¶
Boilerplate-free, reproducible ML experiment workflows built on PyTorch Lightning and hydra-zen.
mushin is the evaluate-and-report layer sitting on top of Lightning and
hydra-zen. Define your experiment as a function, sweep over parameters with
Hydra, and get results back as a labeled xarray.Dataset — not rows in a
dashboard you have to export.
Three pillars¶
Sweeps → datasets.
MultiRunMetricsWorkflow runs a Hydra multirun, collects your returned
metrics, and assembles them into a labeled xarray.Dataset keyed by
the swept parameters.
compare with statistics.
benchmark.compare evaluates a set of trained models on a standard metric
battery (torchmetrics), then runs pairwise significance tests (scipy) with
Holm correction. The result is a BenchmarkResult with a paper-ready
.summary(), tidy .comparisons DataFrame, and a labeled .data dataset.
Study.
Study orchestrates the full pipeline — multi-seed training sweep via Hydra,
then straight into compare — in one call. Study.from_checkpoints handles
the eval-only case when you already have checkpoints.
Quick example¶
import torch as tr
from mushin import multirun
from mushin.workflows import MultiRunMetricsWorkflow
class LRSweep(MultiRunMetricsWorkflow):
@staticmethod
def task(lr: float, seed: int) -> dict:
tr.manual_seed(seed)
# ... train a model, then evaluate it ...
acc = ... # your validation accuracy
return dict(accuracy=acc)
wf = LRSweep()
wf.run(lr=multirun([0.01, 0.1, 1.0]), seed=multirun([0, 1, 2])) # 9 runs
ds = wf.to_xarray()
# <xarray.Dataset> Dimensions: (lr: 3, seed: 3)
# Data variables: accuracy (lr, seed)
ds["accuracy"].mean("seed") # average over seeds, per learning rate
Get started¶
- Install — pip, extras, and the support matrix
- Quickstart — run the flagship sweep example end-to-end
- Guides — workflows, compare, Study, segmentation, MCP
- API Reference — full auto-generated docs