Ctrl K

Python Statistical Tests Reference

This notebook is a practical reference for paired statistical comparison workflows in Python.

It covers:

  • how to shape paired comparison tables
  • how to aggregate repeated metric rows
  • how to compute grouped means and standard deviations
  • how to run Friedman tests across several conditions
  • how to run Wilcoxon signed-rank tests for paired comparisons
  • how to apply Bonferroni correction
  • how to build model comparison tables
  • how to save reusable outputs

The examples use generic units, models, conditions, windows, and metrics so the same patterns can be reused in model evaluation, forecasting, retrieval, experimentation, and small analysis pipelines.

Setup

The notebook uses pandas, numpy, and scipy.

The examples create local folders so each table can be saved and inspected after the notebook runs.

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
from scipy.stats import friedmanchisquare, wilcoxon
from itertools import combinations

data_dir = Path("data")
outputs_dir = Path("outputs") / "statistical_tests_reference"

data_dir.mkdir(parents=True, exist_ok=True)
outputs_dir.mkdir(parents=True, exist_ok=True)

# print(data_dir.resolve())
# print(outputs_dir.resolve())

Create sample paired data

Most paired tests need repeated measurements over the same units.

In this sample dataset:

  • unit is the repeated item being measured
  • model is the model being evaluated
  • condition is the model setting or method condition
  • window is a repeated sub-group
  • metric is a lower-is-better score

The same units appear across conditions, which makes paired testing possible.

In [2]:
units = [f"unit_{i:02d}" for i in range(1, 9)]
windows = ["short", "long"]
models = ["model_a", "model_b", "model_c"]
conditions = ["condition_1", "condition_2", "condition_3"]

rows = []

base_by_unit = {
    "unit_01": 4.2,
    "unit_02": 3.8,
    "unit_03": 5.1,
    "unit_04": 4.6,
    "unit_05": 3.9,
    "unit_06": 4.9,
    "unit_07": 4.4,
    "unit_08": 5.3,
}

condition_effect = {
    "condition_1": 0.20,
    "condition_2": -0.05,
    "condition_3": -0.22,
}

model_effect = {
    "model_a": 0.10,
    "model_b": -0.05,
    "model_c": 0.18,
}

window_effect = {
    "short": -0.04,
    "long": 0.04,
}

for unit in units:
    for model in models:
        for condition in conditions:
            for window in windows:
                metric = (
                    base_by_unit[unit]
                    + model_effect[model]
                    + condition_effect[condition]
                    + window_effect[window]
                )
                rows.append({
                    "unit": unit,
                    "model": model,
                    "condition": condition,
                    "window": window,
                    "metric": round(metric, 3),
                })

raw = pd.DataFrame(rows)

sample_path = data_dir / "statistical_tests_sample.csv"
raw.to_csv(sample_path, index=False)

print(raw.head(12))
print(raw.shape)
print(sample_path)
       unit    model    condition window  metric
0   unit_01  model_a  condition_1  short    4.46
1   unit_01  model_a  condition_1   long    4.54
2   unit_01  model_a  condition_2  short    4.21
3   unit_01  model_a  condition_2   long    4.29
4   unit_01  model_a  condition_3  short    4.04
5   unit_01  model_a  condition_3   long    4.12
6   unit_01  model_b  condition_1  short    4.31
7   unit_01  model_b  condition_1   long    4.39
8   unit_01  model_b  condition_2  short    4.06
9   unit_01  model_b  condition_2   long    4.14
10  unit_01  model_b  condition_3  short    3.89
11  unit_01  model_b  condition_3   long    3.97
(144, 5)
data/statistical_tests_sample.csv

Basic comparison table shape

A paired comparison table has one row per unit and one column per condition.

This is the shape required by many repeated-measures tests.

In [3]:
simple_df = pd.DataFrame({
    "unit": ["A", "B", "C", "D"],
    "condition_1": [4.2, 3.6, 5.1, 4.8],
    "condition_2": [3.9, 3.4, 4.8, 4.5],
    "condition_3": [3.7, 3.5, 4.7, 4.4],
})

simple_df
Out[3]:
unit condition_1 condition_2 condition_3
0 A 4.2 3.9 3.7
1 B 3.6 3.4 3.5
2 C 5.1 4.8 4.7
3 D 4.8 4.5 4.4

The unit can be a subject, query set, fold group, region, product, dataset, or repeated item.

The condition can be a model, feature set, prompt, method, or treatment.

What paired observation means

A paired observation means the same unit is measured under two or more conditions.

The comparison happens within the same unit. This removes unit-level differences from the comparison.

In [4]:
paired_example = pd.DataFrame({
    "unit": ["A", "B", "C"],
    "condition_1": [4.2, 3.6, 5.1],
    "condition_2": [3.9, 3.4, 4.8],
})

paired_example["paired_difference"] = paired_example["condition_1"] - paired_example["condition_2"]

paired_example
Out[4]:
unit condition_1 condition_2 paired_difference
0 A 4.2 3.9 0.3
1 B 3.6 3.4 0.2
2 C 5.1 4.8 0.3

Use a paired test only when rows refer to matched units.

Do not compare unrelated units as if they were pairs.

Aggregate repeated metric rows

Real evaluation outputs often have multiple rows per unit and condition.

Before paired testing, create one value per unit and condition. In this example, the short and long windows are averaged first.

In [5]:
model_name = "model_a"

model_raw = raw[raw["model"] == model_name].copy()

agg = (
    model_raw
    .groupby(["unit", "condition"], as_index=False)["metric"]
    .mean()
)

wide = (
    agg
    .pivot(index="unit", columns="condition", values="metric")
    .dropna()
)

wide
Out[5]:
condition condition_1 condition_2 condition_3
unit
unit_01 4.5 4.25 4.08
unit_02 4.1 3.85 3.68
unit_03 5.4 5.15 4.98
unit_04 4.9 4.65 4.48
unit_05 4.2 3.95 3.78
unit_06 5.2 4.95 4.78
unit_07 4.7 4.45 4.28
unit_08 5.6 5.35 5.18

After aggregation:

  • each row is one paired unit
  • each column is one condition
  • each cell is the average metric for that unit and condition

This wide table is now ready for paired statistical tests.

Mean and standard deviation over grouped values

Grouped summary cells can show mean and standard deviation.

Be clear about what the standard deviation is across. In this example, the standard deviation is across the rows inside each model, condition, and window group.

In [6]:
summary = (
    raw
    .groupby(["model", "condition", "window"], as_index=False)["metric"]
    .agg(metric_mean="mean", metric_std="std")
)

summary["cell"] = summary.apply(
    lambda row: f"{row['metric_mean']:.3f} ± {row['metric_std']:.3f}",
    axis=1
)

summary.head(12)
Out[6]:
model condition window metric_mean metric_std cell
0 model_a condition_1 long 4.865 0.549675 4.865 ± 0.550
1 model_a condition_1 short 4.785 0.549675 4.785 ± 0.550
2 model_a condition_2 long 4.615 0.549675 4.615 ± 0.550
3 model_a condition_2 short 4.535 0.549675 4.535 ± 0.550
4 model_a condition_3 long 4.445 0.549675 4.445 ± 0.550
5 model_a condition_3 short 4.365 0.549675 4.365 ± 0.550
6 model_b condition_1 long 4.715 0.549675 4.715 ± 0.550
7 model_b condition_1 short 4.635 0.549675 4.635 ± 0.550
8 model_b condition_2 long 4.465 0.549675 4.465 ± 0.550
9 model_b condition_2 short 4.385 0.549675 4.385 ± 0.550
10 model_b condition_3 long 4.295 0.549675 4.295 ± 0.550
11 model_b condition_3 short 4.215 0.549675 4.215 ± 0.550
In [7]:
summary_path = outputs_dir / "grouped_metric_summary.csv"
summary.to_csv(summary_path, index=False)

print(summary_path)
outputs/statistical_tests_reference/grouped_metric_summary.csv

If the input rows are already averaged values, the standard deviation is not a raw observation-level standard deviation.

The table label should state the aggregation unit clearly.

Friedman test across several paired conditions

The Friedman test is a rank-based repeated-measures test.

Use it when the same units are measured under three or more conditions.

In [8]:
stat, p_value = friedmanchisquare(
    wide["condition_1"].values,
    wide["condition_2"].values,
    wide["condition_3"].values,
)

print("Friedman statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))
Friedman statistic: 16.0
p-value: 0.000335

The Friedman test gives one overall p-value.

It asks whether the condition columns have a systematic rank difference. It does not say which specific condition pair is different.

Friedman interpretation table

In [9]:
friedman_reference = pd.DataFrame({
    "part": [
        "Null hypothesis",
        "Alternative",
        "Input",
        "Output",
        "Next step",
    ],
    "meaning": [
        "The paired conditions have no systematic rank difference.",
        "At least one condition differs in rank pattern.",
        "Three or more paired columns.",
        "One statistic and one p-value.",
        "Use paired follow-up tests if the result is significant.",
    ],
})

friedman_reference
Out[9]:
part meaning
0 Null hypothesis The paired conditions have no systematic rank ...
1 Alternative At least one condition differs in rank pattern.
2 Input Three or more paired columns.
3 Output One statistic and one p-value.
4 Next step Use paired follow-up tests if the result is si...

Wilcoxon signed-rank test

The Wilcoxon signed-rank test is a paired two-condition test.

It starts with paired differences, ranks the absolute difference sizes, restores the signs, and checks whether signed ranks are balanced around zero.

In [10]:
a = wide["condition_1"]
b = wide["condition_2"]

stat, p_value = wilcoxon(a, b)

print("Wilcoxon statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))
Wilcoxon statistic: 0.0
p-value: 0.007812

The default SciPy test is two-sided.

The null hypothesis is that the median paired difference is zero.

How Wilcoxon ranks differences

Wilcoxon does not only count positive and negative signs. It ranks the absolute difference sizes, then uses the signs.

In [11]:
example = pd.DataFrame({
    "a": [4.2, 3.6, 5.1, 4.0, 3.9],
    "b": [3.9, 3.4, 5.0, 4.5, 3.8],
})

example["diff"] = example["a"] - example["b"]
example["abs_diff"] = example["diff"].abs()
example["rank_abs_diff"] = example["abs_diff"].rank()

example
Out[11]:
a b diff abs_diff rank_abs_diff
0 4.2 3.9 0.3 0.3 4.0
1 3.6 3.4 0.2 0.2 3.0
2 5.1 5.0 0.1 0.1 1.0
3 4.0 4.5 -0.5 0.5 5.0
4 3.9 3.8 0.1 0.1 2.0

Positive differences mean a is larger than b.

Negative differences mean a is smaller than b.

For lower-is-better metrics, a negative difference can mean a performs better than b.

Wilcoxon assumptions

In [12]:
wilcoxon_assumptions = pd.DataFrame({
    "assumption": [
        "Paired observations",
        "Independent pairs",
        "Rankable differences",
        "Rough symmetry",
    ],
    "meaning": [
        "The same units appear in both conditions.",
        "One unit should not duplicate another unit.",
        "Differences should be numeric or ordinal enough to rank.",
        "The paired difference distribution should be roughly symmetric around the median.",
    ],
})

wilcoxon_assumptions
Out[12]:
assumption meaning
0 Paired observations The same units appear in both conditions.
1 Independent pairs One unit should not duplicate another unit.
2 Rankable differences Differences should be numeric or ordinal enoug...
3 Rough symmetry The paired difference distribution should be r...

Bonferroni correction

Bonferroni correction adjusts p-values when multiple tests are run.

It multiplies each raw p-value by the number of tests and caps the result at 1.0.

In [13]:
raw_p_values = [0.012, 0.030, 0.200, 0.004]
n_tests = len(raw_p_values)

corrected = [min(p * n_tests, 1.0) for p in raw_p_values]

pd.DataFrame({
    "raw_p_value": raw_p_values,
    "bonferroni_p_value": corrected,
})
Out[13]:
raw_p_value bonferroni_p_value
0 0.012 0.048
1 0.030 0.120
2 0.200 0.800
3 0.004 0.016

Bonferroni reduces false positives from repeated testing.

It is simple and strict. For many pairwise comparisons, report both raw and adjusted p-values.

Friedman plus Wilcoxon workflow

A common workflow is:

  1. run Friedman as the overall test
  2. if significant, run selected Wilcoxon follow-up tests
  3. adjust follow-up p-values
In [14]:
conditions = ["condition_1", "condition_2", "condition_3"]

stat, p_overall = friedmanchisquare(*[wide[c].values for c in conditions])

rows = []

if p_overall < 0.05:
    pairs = list(combinations(conditions, 2))

    for left, right in pairs:
        stat_pair, p_raw = wilcoxon(wide[left], wide[right])
        p_adjusted = min(p_raw * len(pairs), 1.0)

        rows.append({
            "left": left,
            "right": right,
            "wilcoxon_stat": stat_pair,
            "raw_p_value": p_raw,
            "bonferroni_p_value": p_adjusted,
        })

follow_up = pd.DataFrame(rows)

print("Friedman p-value:", p_overall)

follow_up
Friedman p-value: 0.0003354626279025119
Out[14]:
left right wilcoxon_stat raw_p_value bonferroni_p_value
0 condition_1 condition_2 0.0 0.007812 0.023438
1 condition_1 condition_3 0.0 0.007812 0.023438
2 condition_2 condition_3 0.0 0.007812 0.023438
In [15]:
follow_up_path = outputs_dir / "friedman_wilcoxon_follow_up.csv"
follow_up.to_csv(follow_up_path, index=False)

print(follow_up_path)
outputs/statistical_tests_reference/friedman_wilcoxon_follow_up.csv

Pairwise model comparison

To compare several models, build one column per model and keep rows matched by the same unit.

This example compares models under one selected condition.

In [16]:
selected_condition = "condition_2"

model_condition = (
    raw[raw["condition"] == selected_condition]
    .groupby(["unit", "model"], as_index=False)["metric"]
    .mean()
)

model_scores = (
    model_condition
    .pivot(index="unit", columns="model", values="metric")
    .dropna()
)

model_scores
Out[16]:
model model_a model_b model_c
unit
unit_01 4.25 4.1 4.33
unit_02 3.85 3.7 3.93
unit_03 5.15 5.0 5.23
unit_04 4.65 4.5 4.73
unit_05 3.95 3.8 4.03
unit_06 4.95 4.8 5.03
unit_07 4.45 4.3 4.53
unit_08 5.35 5.2 5.43
In [17]:
models = model_scores.columns.tolist()
pairwise_rows = []

for left, right in combinations(models, 2):
    s1 = model_scores[left].dropna()
    s2 = model_scores[right].dropna()

    idx = s1.index.intersection(s2.index)

    stat, p_value = wilcoxon(s1.loc[idx], s2.loc[idx])

    pairwise_rows.append({
        "left": left,
        "right": right,
        "n_units": len(idx),
        "wilcoxon_stat": stat,
        "raw_p_value": p_value,
    })

pairwise_model_tests = pd.DataFrame(pairwise_rows)
pairwise_model_tests["bonferroni_p_value"] = pairwise_model_tests["raw_p_value"].apply(
    lambda p: min(p * len(pairwise_model_tests), 1.0)
)

pairwise_model_tests
Out[17]:
left right n_units wilcoxon_stat raw_p_value bonferroni_p_value
0 model_a model_b 8 0.0 0.007812 0.023438
1 model_a model_c 8 0.0 0.007812 0.023438
2 model_b model_c 8 0.0 0.007812 0.023438
In [18]:
pairwise_path = outputs_dir / "pairwise_model_tests.csv"
pairwise_model_tests.to_csv(pairwise_path, index=False)

print(pairwise_path)
outputs/statistical_tests_reference/pairwise_model_tests.csv

Each row must refer to the same unit across models.

Use multiple-comparison correction when making strict significance claims across many pairs.

Choose best condition before model comparison

When each model has several conditions, one practical comparison is to choose each model's best average condition first.

Then compare those selected best configurations across matched units.

In [19]:
best_conditions = {}

for model_name in raw["model"].unique():
    sub = raw[raw["model"] == model_name]
    best_conditions[model_name] = (
        sub
        .groupby("condition", as_index=False)["metric"]
        .mean()
        .sort_values("metric")
        .iloc[0]["condition"]
    )

best_table = pd.DataFrame()

for model_name, condition_name in best_conditions.items():
    sub = raw[(raw["model"] == model_name) & (raw["condition"] == condition_name)]
    series = (
        sub
        .groupby("unit")["metric"]
        .mean()
        .rename(model_name)
    )
    best_table = pd.concat([best_table, series], axis=1)

print(best_conditions)

best_table
{'model_a': 'condition_3', 'model_b': 'condition_3', 'model_c': 'condition_3'}
Out[19]:
model_a model_b model_c
unit_01 4.08 3.93 4.16
unit_02 3.68 3.53 3.76
unit_03 4.98 4.83 5.06
unit_04 4.48 4.33 4.56
unit_05 3.78 3.63 3.86
unit_06 4.78 4.63 4.86
unit_07 4.28 4.13 4.36
unit_08 5.18 5.03 5.26
In [20]:
best_rows = []

for left, right in combinations(best_table.columns.tolist(), 2):
    idx = best_table[left].dropna().index.intersection(best_table[right].dropna().index)

    stat, p_value = wilcoxon(best_table.loc[idx, left], best_table.loc[idx, right])

    best_rows.append({
        "left": left,
        "left_condition": best_conditions[left],
        "right": right,
        "right_condition": best_conditions[right],
        "n_units": len(idx),
        "wilcoxon_stat": stat,
        "raw_p_value": p_value,
    })

best_condition_tests = pd.DataFrame(best_rows)
best_condition_tests["bonferroni_p_value"] = best_condition_tests["raw_p_value"].apply(
    lambda p: min(p * len(best_condition_tests), 1.0)
)

best_condition_tests
Out[20]:
left left_condition right right_condition n_units wilcoxon_stat raw_p_value bonferroni_p_value
0 model_a condition_3 model_b condition_3 8 0.0 0.007812 0.023438
1 model_a condition_3 model_c condition_3 8 0.0 0.007812 0.023438
2 model_b condition_3 model_c condition_3 8 0.0 0.007812 0.023438