Core concepts¶

Workflows: sweep → dataset¶

A mushin workflow is a sweep-and-collect pattern. You define your experiment as a task(...) method and run it across a grid of hyperparameters. mushin uses Hydra under the hood to launch one job per configuration, each in its own directory, and assembles the returned metrics into a labeled xarray.Dataset.

The key insight: the dataset dimensions are your swept parameters (e.g. lr, seed), and the data variables are whatever your task returns (e.g. accuracy, loss). This gives you a structured, labeled result rather than a list of floats.

See Workflows & sweeps and the API Reference — workflows.

The (method × seed) dataset¶

Reproducible comparison requires running each method across multiple seeds. mushin structures the evaluation results as an xarray.Dataset with dimensions method and seed:

<xarray.Dataset> Dimensions: (method: 2, seed: 3)
  Coordinates:
    method  (method)  object  'cnn'  'mlp'
    seed    (seed)    int64   0  1  2
  Data variables:
    accuracy   (method, seed)  float64  ...
    f1         (method, seed)  float64  ...

This structure makes it natural to: - Compute per-method means: ds["accuracy"].mean("seed") - Slice a single method: ds.sel(method="cnn") - Export to NetCDF for later analysis: ds.to_netcdf("results.nc")

Statistical comparison: why seeds + significance¶

Training a model is stochastic (random initialization, data shuffling). A single run can produce an outlier. By running each method with multiple seeds, mushin captures the natural variance of training and uses it to answer the question: is the observed difference likely to hold up on a new seed?

mushin applies a pairwise significance test (Welch, Wilcoxon, or Mann-Whitney U) and corrects for multiple comparisons with the Holm–Bonferroni procedure. The result tells you not just which method scored higher on average, but whether that difference is statistically reliable.

See Understanding the statistics for details on test selection and the Holm correction.

The task registry seam¶

compare and Study accept a task= parameter that selects the metric battery and the default prediction logic:

`task=`	Battery	Default predict_fn
`"classification"`	accuracy, f1, precision, recall, auroc, ece	argmax + softmax
`"segmentation"`	miou, dice, pixel_acc, precision, recall	argmax + softmax over spatial dims

You can override either end: - Pass metrics= to replace the battery entirely. - Pass predict_fn= to adapt models that return dicts or non-standard tensors.

See Custom metrics & predict_fn.