Python Statistical Tests Reference
This notebook is a practical reference for paired statistical comparison workflows in Python.
It covers:
- how to shape paired comparison tables
- how to aggregate repeated metric rows
- how to compute grouped means and standard deviations
- how to run Friedman tests across several conditions
- how to run Wilcoxon signed-rank tests for paired comparisons
- how to apply Bonferroni correction
- how to build model comparison tables
- how to save reusable outputs
The examples use generic units, models, conditions, windows, and metrics so the same patterns can be reused in model evaluation, forecasting, retrieval, experimentation, and small analysis pipelines.
Setup
The notebook uses pandas, numpy, and scipy.
The examples create local folders so each table can be saved and inspected after the notebook runs.
from pathlib import Path
import numpy as np
import pandas as pd
from scipy.stats import friedmanchisquare, wilcoxon
from itertools import combinations
data_dir = Path("data")
outputs_dir = Path("outputs") / "statistical_tests_reference"
data_dir.mkdir(parents=True, exist_ok=True)
outputs_dir.mkdir(parents=True, exist_ok=True)
# print(data_dir.resolve())
# print(outputs_dir.resolve())
Create sample paired data
Most paired tests need repeated measurements over the same units.
In this sample dataset:
unitis the repeated item being measuredmodelis the model being evaluatedconditionis the model setting or method conditionwindowis a repeated sub-groupmetricis a lower-is-better score
The same units appear across conditions, which makes paired testing possible.
units = [f"unit_{i:02d}" for i in range(1, 9)]
windows = ["short", "long"]
models = ["model_a", "model_b", "model_c"]
conditions = ["condition_1", "condition_2", "condition_3"]
rows = []
base_by_unit = {
"unit_01": 4.2,
"unit_02": 3.8,
"unit_03": 5.1,
"unit_04": 4.6,
"unit_05": 3.9,
"unit_06": 4.9,
"unit_07": 4.4,
"unit_08": 5.3,
}
condition_effect = {
"condition_1": 0.20,
"condition_2": -0.05,
"condition_3": -0.22,
}
model_effect = {
"model_a": 0.10,
"model_b": -0.05,
"model_c": 0.18,
}
window_effect = {
"short": -0.04,
"long": 0.04,
}
for unit in units:
for model in models:
for condition in conditions:
for window in windows:
metric = (
base_by_unit[unit]
+ model_effect[model]
+ condition_effect[condition]
+ window_effect[window]
)
rows.append({
"unit": unit,
"model": model,
"condition": condition,
"window": window,
"metric": round(metric, 3),
})
raw = pd.DataFrame(rows)
sample_path = data_dir / "statistical_tests_sample.csv"
raw.to_csv(sample_path, index=False)
print(raw.head(12))
print(raw.shape)
print(sample_path)
unit model condition window metric 0 unit_01 model_a condition_1 short 4.46 1 unit_01 model_a condition_1 long 4.54 2 unit_01 model_a condition_2 short 4.21 3 unit_01 model_a condition_2 long 4.29 4 unit_01 model_a condition_3 short 4.04 5 unit_01 model_a condition_3 long 4.12 6 unit_01 model_b condition_1 short 4.31 7 unit_01 model_b condition_1 long 4.39 8 unit_01 model_b condition_2 short 4.06 9 unit_01 model_b condition_2 long 4.14 10 unit_01 model_b condition_3 short 3.89 11 unit_01 model_b condition_3 long 3.97 (144, 5) data/statistical_tests_sample.csv
Basic comparison table shape
A paired comparison table has one row per unit and one column per condition.
This is the shape required by many repeated-measures tests.
simple_df = pd.DataFrame({
"unit": ["A", "B", "C", "D"],
"condition_1": [4.2, 3.6, 5.1, 4.8],
"condition_2": [3.9, 3.4, 4.8, 4.5],
"condition_3": [3.7, 3.5, 4.7, 4.4],
})
simple_df
| unit | condition_1 | condition_2 | condition_3 | |
|---|---|---|---|---|
| 0 | A | 4.2 | 3.9 | 3.7 |
| 1 | B | 3.6 | 3.4 | 3.5 |
| 2 | C | 5.1 | 4.8 | 4.7 |
| 3 | D | 4.8 | 4.5 | 4.4 |
The unit can be a subject, query set, fold group, region, product, dataset, or repeated item.
The condition can be a model, feature set, prompt, method, or treatment.
What paired observation means
A paired observation means the same unit is measured under two or more conditions.
The comparison happens within the same unit. This removes unit-level differences from the comparison.
paired_example = pd.DataFrame({
"unit": ["A", "B", "C"],
"condition_1": [4.2, 3.6, 5.1],
"condition_2": [3.9, 3.4, 4.8],
})
paired_example["paired_difference"] = paired_example["condition_1"] - paired_example["condition_2"]
paired_example
| unit | condition_1 | condition_2 | paired_difference | |
|---|---|---|---|---|
| 0 | A | 4.2 | 3.9 | 0.3 |
| 1 | B | 3.6 | 3.4 | 0.2 |
| 2 | C | 5.1 | 4.8 | 0.3 |
Use a paired test only when rows refer to matched units.
Do not compare unrelated units as if they were pairs.
Aggregate repeated metric rows
Real evaluation outputs often have multiple rows per unit and condition.
Before paired testing, create one value per unit and condition. In this example, the short and long windows are averaged first.
model_name = "model_a"
model_raw = raw[raw["model"] == model_name].copy()
agg = (
model_raw
.groupby(["unit", "condition"], as_index=False)["metric"]
.mean()
)
wide = (
agg
.pivot(index="unit", columns="condition", values="metric")
.dropna()
)
wide
| condition | condition_1 | condition_2 | condition_3 |
|---|---|---|---|
| unit | |||
| unit_01 | 4.5 | 4.25 | 4.08 |
| unit_02 | 4.1 | 3.85 | 3.68 |
| unit_03 | 5.4 | 5.15 | 4.98 |
| unit_04 | 4.9 | 4.65 | 4.48 |
| unit_05 | 4.2 | 3.95 | 3.78 |
| unit_06 | 5.2 | 4.95 | 4.78 |
| unit_07 | 4.7 | 4.45 | 4.28 |
| unit_08 | 5.6 | 5.35 | 5.18 |
After aggregation:
- each row is one paired unit
- each column is one condition
- each cell is the average metric for that unit and condition
This wide table is now ready for paired statistical tests.
Mean and standard deviation over grouped values
Grouped summary cells can show mean and standard deviation.
Be clear about what the standard deviation is across. In this example, the standard deviation is across the rows inside each model, condition, and window group.
summary = (
raw
.groupby(["model", "condition", "window"], as_index=False)["metric"]
.agg(metric_mean="mean", metric_std="std")
)
summary["cell"] = summary.apply(
lambda row: f"{row['metric_mean']:.3f} ± {row['metric_std']:.3f}",
axis=1
)
summary.head(12)
| model | condition | window | metric_mean | metric_std | cell | |
|---|---|---|---|---|---|---|
| 0 | model_a | condition_1 | long | 4.865 | 0.549675 | 4.865 ± 0.550 |
| 1 | model_a | condition_1 | short | 4.785 | 0.549675 | 4.785 ± 0.550 |
| 2 | model_a | condition_2 | long | 4.615 | 0.549675 | 4.615 ± 0.550 |
| 3 | model_a | condition_2 | short | 4.535 | 0.549675 | 4.535 ± 0.550 |
| 4 | model_a | condition_3 | long | 4.445 | 0.549675 | 4.445 ± 0.550 |
| 5 | model_a | condition_3 | short | 4.365 | 0.549675 | 4.365 ± 0.550 |
| 6 | model_b | condition_1 | long | 4.715 | 0.549675 | 4.715 ± 0.550 |
| 7 | model_b | condition_1 | short | 4.635 | 0.549675 | 4.635 ± 0.550 |
| 8 | model_b | condition_2 | long | 4.465 | 0.549675 | 4.465 ± 0.550 |
| 9 | model_b | condition_2 | short | 4.385 | 0.549675 | 4.385 ± 0.550 |
| 10 | model_b | condition_3 | long | 4.295 | 0.549675 | 4.295 ± 0.550 |
| 11 | model_b | condition_3 | short | 4.215 | 0.549675 | 4.215 ± 0.550 |
summary_path = outputs_dir / "grouped_metric_summary.csv"
summary.to_csv(summary_path, index=False)
print(summary_path)
outputs/statistical_tests_reference/grouped_metric_summary.csv
If the input rows are already averaged values, the standard deviation is not a raw observation-level standard deviation.
The table label should state the aggregation unit clearly.
Friedman test across several paired conditions
The Friedman test is a rank-based repeated-measures test.
Use it when the same units are measured under three or more conditions.
stat, p_value = friedmanchisquare(
wide["condition_1"].values,
wide["condition_2"].values,
wide["condition_3"].values,
)
print("Friedman statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))
Friedman statistic: 16.0 p-value: 0.000335
The Friedman test gives one overall p-value.
It asks whether the condition columns have a systematic rank difference. It does not say which specific condition pair is different.
Friedman interpretation table
friedman_reference = pd.DataFrame({
"part": [
"Null hypothesis",
"Alternative",
"Input",
"Output",
"Next step",
],
"meaning": [
"The paired conditions have no systematic rank difference.",
"At least one condition differs in rank pattern.",
"Three or more paired columns.",
"One statistic and one p-value.",
"Use paired follow-up tests if the result is significant.",
],
})
friedman_reference
| part | meaning | |
|---|---|---|
| 0 | Null hypothesis | The paired conditions have no systematic rank ... |
| 1 | Alternative | At least one condition differs in rank pattern. |
| 2 | Input | Three or more paired columns. |
| 3 | Output | One statistic and one p-value. |
| 4 | Next step | Use paired follow-up tests if the result is si... |
Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a paired two-condition test.
It starts with paired differences, ranks the absolute difference sizes, restores the signs, and checks whether signed ranks are balanced around zero.
a = wide["condition_1"]
b = wide["condition_2"]
stat, p_value = wilcoxon(a, b)
print("Wilcoxon statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))
Wilcoxon statistic: 0.0 p-value: 0.007812
The default SciPy test is two-sided.
The null hypothesis is that the median paired difference is zero.
How Wilcoxon ranks differences
Wilcoxon does not only count positive and negative signs. It ranks the absolute difference sizes, then uses the signs.
example = pd.DataFrame({
"a": [4.2, 3.6, 5.1, 4.0, 3.9],
"b": [3.9, 3.4, 5.0, 4.5, 3.8],
})
example["diff"] = example["a"] - example["b"]
example["abs_diff"] = example["diff"].abs()
example["rank_abs_diff"] = example["abs_diff"].rank()
example
| a | b | diff | abs_diff | rank_abs_diff | |
|---|---|---|---|---|---|
| 0 | 4.2 | 3.9 | 0.3 | 0.3 | 4.0 |
| 1 | 3.6 | 3.4 | 0.2 | 0.2 | 3.0 |
| 2 | 5.1 | 5.0 | 0.1 | 0.1 | 1.0 |
| 3 | 4.0 | 4.5 | -0.5 | 0.5 | 5.0 |
| 4 | 3.9 | 3.8 | 0.1 | 0.1 | 2.0 |
Positive differences mean a is larger than b.
Negative differences mean a is smaller than b.
For lower-is-better metrics, a negative difference can mean a performs better than b.
Wilcoxon assumptions
wilcoxon_assumptions = pd.DataFrame({
"assumption": [
"Paired observations",
"Independent pairs",
"Rankable differences",
"Rough symmetry",
],
"meaning": [
"The same units appear in both conditions.",
"One unit should not duplicate another unit.",
"Differences should be numeric or ordinal enough to rank.",
"The paired difference distribution should be roughly symmetric around the median.",
],
})
wilcoxon_assumptions
| assumption | meaning | |
|---|---|---|
| 0 | Paired observations | The same units appear in both conditions. |
| 1 | Independent pairs | One unit should not duplicate another unit. |
| 2 | Rankable differences | Differences should be numeric or ordinal enoug... |
| 3 | Rough symmetry | The paired difference distribution should be r... |
Bonferroni correction
Bonferroni correction adjusts p-values when multiple tests are run.
It multiplies each raw p-value by the number of tests and caps the result at 1.0.
raw_p_values = [0.012, 0.030, 0.200, 0.004]
n_tests = len(raw_p_values)
corrected = [min(p * n_tests, 1.0) for p in raw_p_values]
pd.DataFrame({
"raw_p_value": raw_p_values,
"bonferroni_p_value": corrected,
})
| raw_p_value | bonferroni_p_value | |
|---|---|---|
| 0 | 0.012 | 0.048 |
| 1 | 0.030 | 0.120 |
| 2 | 0.200 | 0.800 |
| 3 | 0.004 | 0.016 |
Bonferroni reduces false positives from repeated testing.
It is simple and strict. For many pairwise comparisons, report both raw and adjusted p-values.
Friedman plus Wilcoxon workflow
A common workflow is:
- run Friedman as the overall test
- if significant, run selected Wilcoxon follow-up tests
- adjust follow-up p-values
conditions = ["condition_1", "condition_2", "condition_3"]
stat, p_overall = friedmanchisquare(*[wide[c].values for c in conditions])
rows = []
if p_overall < 0.05:
pairs = list(combinations(conditions, 2))
for left, right in pairs:
stat_pair, p_raw = wilcoxon(wide[left], wide[right])
p_adjusted = min(p_raw * len(pairs), 1.0)
rows.append({
"left": left,
"right": right,
"wilcoxon_stat": stat_pair,
"raw_p_value": p_raw,
"bonferroni_p_value": p_adjusted,
})
follow_up = pd.DataFrame(rows)
print("Friedman p-value:", p_overall)
follow_up
Friedman p-value: 0.0003354626279025119
| left | right | wilcoxon_stat | raw_p_value | bonferroni_p_value | |
|---|---|---|---|---|---|
| 0 | condition_1 | condition_2 | 0.0 | 0.007812 | 0.023438 |
| 1 | condition_1 | condition_3 | 0.0 | 0.007812 | 0.023438 |
| 2 | condition_2 | condition_3 | 0.0 | 0.007812 | 0.023438 |
follow_up_path = outputs_dir / "friedman_wilcoxon_follow_up.csv"
follow_up.to_csv(follow_up_path, index=False)
print(follow_up_path)
outputs/statistical_tests_reference/friedman_wilcoxon_follow_up.csv
Pairwise model comparison
To compare several models, build one column per model and keep rows matched by the same unit.
This example compares models under one selected condition.
selected_condition = "condition_2"
model_condition = (
raw[raw["condition"] == selected_condition]
.groupby(["unit", "model"], as_index=False)["metric"]
.mean()
)
model_scores = (
model_condition
.pivot(index="unit", columns="model", values="metric")
.dropna()
)
model_scores
| model | model_a | model_b | model_c |
|---|---|---|---|
| unit | |||
| unit_01 | 4.25 | 4.1 | 4.33 |
| unit_02 | 3.85 | 3.7 | 3.93 |
| unit_03 | 5.15 | 5.0 | 5.23 |
| unit_04 | 4.65 | 4.5 | 4.73 |
| unit_05 | 3.95 | 3.8 | 4.03 |
| unit_06 | 4.95 | 4.8 | 5.03 |
| unit_07 | 4.45 | 4.3 | 4.53 |
| unit_08 | 5.35 | 5.2 | 5.43 |
models = model_scores.columns.tolist()
pairwise_rows = []
for left, right in combinations(models, 2):
s1 = model_scores[left].dropna()
s2 = model_scores[right].dropna()
idx = s1.index.intersection(s2.index)
stat, p_value = wilcoxon(s1.loc[idx], s2.loc[idx])
pairwise_rows.append({
"left": left,
"right": right,
"n_units": len(idx),
"wilcoxon_stat": stat,
"raw_p_value": p_value,
})
pairwise_model_tests = pd.DataFrame(pairwise_rows)
pairwise_model_tests["bonferroni_p_value"] = pairwise_model_tests["raw_p_value"].apply(
lambda p: min(p * len(pairwise_model_tests), 1.0)
)
pairwise_model_tests
| left | right | n_units | wilcoxon_stat | raw_p_value | bonferroni_p_value | |
|---|---|---|---|---|---|---|
| 0 | model_a | model_b | 8 | 0.0 | 0.007812 | 0.023438 |
| 1 | model_a | model_c | 8 | 0.0 | 0.007812 | 0.023438 |
| 2 | model_b | model_c | 8 | 0.0 | 0.007812 | 0.023438 |
pairwise_path = outputs_dir / "pairwise_model_tests.csv"
pairwise_model_tests.to_csv(pairwise_path, index=False)
print(pairwise_path)
outputs/statistical_tests_reference/pairwise_model_tests.csv
Each row must refer to the same unit across models.
Use multiple-comparison correction when making strict significance claims across many pairs.
Choose best condition before model comparison
When each model has several conditions, one practical comparison is to choose each model's best average condition first.
Then compare those selected best configurations across matched units.
best_conditions = {}
for model_name in raw["model"].unique():
sub = raw[raw["model"] == model_name]
best_conditions[model_name] = (
sub
.groupby("condition", as_index=False)["metric"]
.mean()
.sort_values("metric")
.iloc[0]["condition"]
)
best_table = pd.DataFrame()
for model_name, condition_name in best_conditions.items():
sub = raw[(raw["model"] == model_name) & (raw["condition"] == condition_name)]
series = (
sub
.groupby("unit")["metric"]
.mean()
.rename(model_name)
)
best_table = pd.concat([best_table, series], axis=1)
print(best_conditions)
best_table
{'model_a': 'condition_3', 'model_b': 'condition_3', 'model_c': 'condition_3'}
| model_a | model_b | model_c | |
|---|---|---|---|
| unit_01 | 4.08 | 3.93 | 4.16 |
| unit_02 | 3.68 | 3.53 | 3.76 |
| unit_03 | 4.98 | 4.83 | 5.06 |
| unit_04 | 4.48 | 4.33 | 4.56 |
| unit_05 | 3.78 | 3.63 | 3.86 |
| unit_06 | 4.78 | 4.63 | 4.86 |
| unit_07 | 4.28 | 4.13 | 4.36 |
| unit_08 | 5.18 | 5.03 | 5.26 |
best_rows = []
for left, right in combinations(best_table.columns.tolist(), 2):
idx = best_table[left].dropna().index.intersection(best_table[right].dropna().index)
stat, p_value = wilcoxon(best_table.loc[idx, left], best_table.loc[idx, right])
best_rows.append({
"left": left,
"left_condition": best_conditions[left],
"right": right,
"right_condition": best_conditions[right],
"n_units": len(idx),
"wilcoxon_stat": stat,
"raw_p_value": p_value,
})
best_condition_tests = pd.DataFrame(best_rows)
best_condition_tests["bonferroni_p_value"] = best_condition_tests["raw_p_value"].apply(
lambda p: min(p * len(best_condition_tests), 1.0)
)
best_condition_tests
| left | left_condition | right | right_condition | n_units | wilcoxon_stat | raw_p_value | bonferroni_p_value | |
|---|---|---|---|---|---|---|---|---|
| 0 | model_a | condition_3 | model_b | condition_3 | 8 | 0.0 | 0.007812 | 0.023438 |
| 1 | model_a | condition_3 | model_c | condition_3 | 8 | 0.0 | 0.007812 | 0.023438 |
| 2 | model_b | condition_3 | model_c | condition_3 | 8 | 0.0 | 0.007812 | 0.023438 |