benchmark¶

compare ¶

The compare facade: evaluate methods on a task battery and report significance.

compare ¶

compare(
    methods,
    data,
    task="classification",
    *,
    num_classes=None,
    predict_fn=None,
    metrics=None,
    prob_metrics=None,
    test="wilcoxon",
    alpha=0.05,
    ignore_index=None,
    device=None,
)

Compare methods on a standard battery and report significance.

Parameters:

Name	Type	Description	Default
`task`	`str`	`"classification"` or `"segmentation"`.	`'classification'`
`num_classes`	`int or None`	Required when `metrics` is not provided.	`None`
`ignore_index`	`int or None`	Label to exclude from segmentation metrics (e.g. a void/boundary class).	`None`
`prob_metrics`	`frozenset[str] or None`	Metrics whose names need probabilities; defaults to the task's set.	`None`

BenchmarkResult `dataclass` ¶

Holds benchmark results.

Attributes:

Name	Type	Description
`data`	`Dataset`	Dims `(method, seed)`, one data variable per metric.
`comparisons`	`DataFrame`	Pairwise significance results (see `_stats.compare_methods`).
`alpha`	`float`	Significance level used.

summary ¶

summary(reference=None)

Publication-ready table: per method/metric mean and CI, with a "*" marker when the method differs significantly from reference (default: the first method in data).

benchmark¶

compare ¶

compare ¶

BenchmarkResult dataclass ¶

summary ¶

BenchmarkResult `dataclass` ¶