Skip to content

mushin

Boilerplate-free, reproducible ML experiment workflows built on PyTorch Lightning and hydra-zen.

mushin is the evaluate-and-report layer sitting on top of Lightning and hydra-zen. Define your experiment as a function, sweep over parameters with Hydra, and get results back as a labeled xarray.Dataset — not rows in a dashboard you have to export.

Three pillars

Sweeps → datasets. MultiRunMetricsWorkflow runs a Hydra multirun, collects your returned metrics, and assembles them into a labeled xarray.Dataset keyed by the swept parameters.

compare with statistics. benchmark.compare evaluates a set of trained models on a standard metric battery (torchmetrics), then runs pairwise significance tests (scipy) with Holm correction. The result is a BenchmarkResult with a paper-ready .summary(), tidy .comparisons DataFrame, and a labeled .data dataset.

Study. Study orchestrates the full pipeline — multi-seed training sweep via Hydra, then straight into compare — in one call. Study.from_checkpoints handles the eval-only case when you already have checkpoints.

Quick example

import torch as tr
from mushin import multirun
from mushin.workflows import MultiRunMetricsWorkflow

class LRSweep(MultiRunMetricsWorkflow):
    @staticmethod
    def task(lr: float, seed: int) -> dict:
        tr.manual_seed(seed)
        # ... train a model, then evaluate it ...
        acc = ...  # your validation accuracy
        return dict(accuracy=acc)

wf = LRSweep()
wf.run(lr=multirun([0.01, 0.1, 1.0]), seed=multirun([0, 1, 2]))  # 9 runs

ds = wf.to_xarray()
# <xarray.Dataset> Dimensions: (lr: 3, seed: 3)
#   Data variables: accuracy (lr, seed)

ds["accuracy"].mean("seed")   # average over seeds, per learning rate

Get started

  • Install — pip, extras, and the support matrix
  • Quickstart — run the flagship sweep example end-to-end
  • Guides — workflows, compare, Study, segmentation, MCP
  • API Reference — full auto-generated docs