Understanding the statistics¶
mushin's statistical comparison layer is designed to give you honest answers: not just which method scored higher on average, but whether that difference is reliable given the seed-to-seed variance of training. This page explains the tests, the Holm correction, and how to interpret the results.
The tests¶
Pass test= to compare or Study to select the pairwise significance test:
test= |
Underlying scipy call | Paired? | When to use |
|---|---|---|---|
"wilcoxon" |
scipy.stats.wilcoxon |
Yes | Default; non-normal distributions, ordinal metrics, small n |
"ttest_rel" |
scipy.stats.ttest_rel |
Yes | Paired t-test; approximately normal data, equal variance assumed |
"welch" |
scipy.stats.ttest_ind(equal_var=False) |
No | Gaussian metrics, unequal variance; good general choice |
"ttest_ind" |
scipy.stats.ttest_ind(equal_var=True) |
No | Independent t-test, equal variance assumed |
"mannwhitney" |
scipy.stats.mannwhitneyu |
No | Non-normal, independent samples |
Paired vs independent: Paired tests (wilcoxon, ttest_rel) compare
seed-matched values (seed 0 of method A vs seed 0 of method B). Use them when
the same seeds are used for both methods (the common case in mushin). Independent
tests (welch, ttest_ind, mannwhitney) treat the two groups as unrelated.
Holm–Bonferroni correction¶
When you compare K methods on M metrics, mushin runs K×(K-1)/2 pairwise tests
per metric. Without correction, running many tests inflates the probability of a
false positive. mushin applies the Holm–Bonferroni step-down correction per
metric: it sorts the raw p-values, then adjusts the significance threshold for
each test in proportion to how many tests remain. The corrected p-values are
stored in result.comparisons["p_corrected"].
The family-wise error rate is controlled at your chosen alpha (default 0.05).
Effect size¶
In addition to the p-value, mushin reports Cohen's d (pooled-variance) as
result.comparisons["effect_size"]. This measures the magnitude of the
difference in units of standard deviations:
| |d| | Interpretation | |---|---| | < 0.2 | Negligible | | 0.2 – 0.5 | Small | | 0.5 – 0.8 | Medium | | > 0.8 | Large |
A significant p-value with a small effect size means the difference is real but may not matter in practice. A large effect size with a non-significant p-value often means you have too few seeds.
Single-seed behavior¶
With only one seed per method, there is no within-group variance. scipy tests return a NaN p-value in this case. mushin treats NaN as not significant rather than producing a false positive. NaN p-values are excluded from the Holm correction so they cannot corrupt the correction of valid pairs.
Underpowered-test warning¶
Some tests cannot reach a given alpha no matter how large the between-method
difference is, if the seed count is too low. For example, Wilcoxon over 3 seeds
has a best-case p-value of 0.25 — it can never reach the default alpha=0.05.
mushin warns you:
UserWarning: test='wilcoxon' cannot reach alpha=0.05 with 3 seeds
(best-case p=0.2500); use more seeds or a parametric test such as test='welch'.
Solutions:
- Switch to test="welch" (parametric; can reach significance with 3 seeds).
- Increase the number of seeds (5+ makes Wilcoxon viable).
Interpreting the summary table¶
result.summary()
# method | metric | mean | ci_low | ci_high | significant_vs_ref
# cnn | accuracy | 0.963 | 0.951 | 0.975 |
# mlp | accuracy | 0.941 | 0.928 | 0.954 | *
mean: average metric value across seeds.ci_low/ci_high: 95% confidence interval (Student-t based).significant_vs_ref:"*"if the method differs significantly from the reference (first method listed) after Holm correction.
Pitfalls
- Wilcoxon with few seeds: Cannot reach p < 0.25 with 3 seeds. Use
test="welch"instead. - p-value vs effect size: Statistical significance does not imply
practical significance. Check
effect_sizealongsidep_corrected. - Single seed: NaN p-value → not significant. Always use multiple seeds for meaningful comparisons.
See also¶
- Comparing methods — the
compareAPI - API Reference — benchmark