Python Statistical Tests Reference

This notebook is a practical reference for paired statistical comparison workflows in Python.

It covers:

how to shape paired comparison tables
how to aggregate repeated metric rows
how to compute grouped means and standard deviations
how to run Friedman tests across several conditions
how to run Wilcoxon signed-rank tests for paired comparisons
how to apply Bonferroni correction
how to build model comparison tables
how to save reusable outputs

The examples use generic units, models, conditions, windows, and metrics so the same patterns can be reused in model evaluation, forecasting, retrieval, experimentation, and small analysis pipelines.

Setup

The notebook uses pandas, numpy, and scipy.

The examples create local folders so each table can be saved and inspected after the notebook runs.

In [1]:

from pathlib import Path

import numpy as np
import pandas as pd
from scipy.stats import friedmanchisquare, wilcoxon
from itertools import combinations

data_dir = Path("data")
outputs_dir = Path("outputs") / "statistical_tests_reference"

data_dir.mkdir(parents=True, exist_ok=True)
outputs_dir.mkdir(parents=True, exist_ok=True)

# print(data_dir.resolve())
# print(outputs_dir.resolve())

Create sample paired data

Most paired tests need repeated measurements over the same units.

In this sample dataset:

unit is the repeated item being measured
model is the model being evaluated
condition is the model setting or method condition
window is a repeated sub-group
metric is a lower-is-better score

The same units appear across conditions, which makes paired testing possible.

In [2]:

units = [f"unit_{i:02d}" for i in range(1, 9)]
windows = ["short", "long"]
models = ["model_a", "model_b", "model_c"]
conditions = ["condition_1", "condition_2", "condition_3"]

rows = []

base_by_unit = {
    "unit_01": 4.2,
    "unit_02": 3.8,
    "unit_03": 5.1,
    "unit_04": 4.6,
    "unit_05": 3.9,
    "unit_06": 4.9,
    "unit_07": 4.4,
    "unit_08": 5.3,
}

condition_effect = {
    "condition_1": 0.20,
    "condition_2": -0.05,
    "condition_3": -0.22,
}

model_effect = {
    "model_a": 0.10,
    "model_b": -0.05,
    "model_c": 0.18,
}

window_effect = {
    "short": -0.04,
    "long": 0.04,
}

for unit in units:
    for model in models:
        for condition in conditions:
            for window in windows:
                metric = (
                    base_by_unit[unit]
                    + model_effect[model]
                    + condition_effect[condition]
                    + window_effect[window]
                )
                rows.append({
                    "unit": unit,
                    "model": model,
                    "condition": condition,
                    "window": window,
                    "metric": round(metric, 3),
                })

raw = pd.DataFrame(rows)

sample_path = data_dir / "statistical_tests_sample.csv"
raw.to_csv(sample_path, index=False)

print(raw.head(12))
print(raw.shape)
print(sample_path)

       unit    model    condition window  metric
0   unit_01  model_a  condition_1  short    4.46
1   unit_01  model_a  condition_1   long    4.54
2   unit_01  model_a  condition_2  short    4.21
3   unit_01  model_a  condition_2   long    4.29
4   unit_01  model_a  condition_3  short    4.04
5   unit_01  model_a  condition_3   long    4.12
6   unit_01  model_b  condition_1  short    4.31
7   unit_01  model_b  condition_1   long    4.39
8   unit_01  model_b  condition_2  short    4.06
9   unit_01  model_b  condition_2   long    4.14
10  unit_01  model_b  condition_3  short    3.89
11  unit_01  model_b  condition_3   long    3.97
(144, 5)
data/statistical_tests_sample.csv

Basic comparison table shape

A paired comparison table has one row per unit and one column per condition.

This is the shape required by many repeated-measures tests.

In [3]:

simple_df = pd.DataFrame({
    "unit": ["A", "B", "C", "D"],
    "condition_1": [4.2, 3.6, 5.1, 4.8],
    "condition_2": [3.9, 3.4, 4.8, 4.5],
    "condition_3": [3.7, 3.5, 4.7, 4.4],
})

simple_df

Out[3]:

	unit	condition_1	condition_2	condition_3
0	A	4.2	3.9	3.7
1	B	3.6	3.4	3.5
2	C	5.1	4.8	4.7
3	D	4.8	4.5	4.4

The unit can be a subject, query set, fold group, region, product, dataset, or repeated item.

The condition can be a model, feature set, prompt, method, or treatment.

What paired observation means

A paired observation means the same unit is measured under two or more conditions.

The comparison happens within the same unit. This removes unit-level differences from the comparison.

In [4]:

paired_example = pd.DataFrame({
    "unit": ["A", "B", "C"],
    "condition_1": [4.2, 3.6, 5.1],
    "condition_2": [3.9, 3.4, 4.8],
})

paired_example["paired_difference"] = paired_example["condition_1"] - paired_example["condition_2"]

paired_example

Out[4]:

	unit	condition_1	condition_2	paired_difference
0	A	4.2	3.9	0.3
1	B	3.6	3.4	0.2
2	C	5.1	4.8	0.3

Use a paired test only when rows refer to matched units.

Do not compare unrelated units as if they were pairs.

Aggregate repeated metric rows

Real evaluation outputs often have multiple rows per unit and condition.

Before paired testing, create one value per unit and condition. In this example, the short and long windows are averaged first.

In [5]:

model_name = "model_a"

model_raw = raw[raw["model"] == model_name].copy()

agg = (
    model_raw
    .groupby(["unit", "condition"], as_index=False)["metric"]
    .mean()
)

wide = (
    agg
    .pivot(index="unit", columns="condition", values="metric")
    .dropna()
)

wide

Out[5]:

condition	condition_1	condition_2	condition_3
unit
unit_01	4.5	4.25	4.08
unit_02	4.1	3.85	3.68
unit_03	5.4	5.15	4.98
unit_04	4.9	4.65	4.48
unit_05	4.2	3.95	3.78
unit_06	5.2	4.95	4.78
unit_07	4.7	4.45	4.28
unit_08	5.6	5.35	5.18

After aggregation:

each row is one paired unit
each column is one condition
each cell is the average metric for that unit and condition

This wide table is now ready for paired statistical tests.

Mean and standard deviation over grouped values

Grouped summary cells can show mean and standard deviation.

Be clear about what the standard deviation is across. In this example, the standard deviation is across the rows inside each model, condition, and window group.

In [6]:

summary = (
    raw
    .groupby(["model", "condition", "window"], as_index=False)["metric"]
    .agg(metric_mean="mean", metric_std="std")
)

summary["cell"] = summary.apply(
    lambda row: f"{row['metric_mean']:.3f} ± {row['metric_std']:.3f}",
    axis=1
)

summary.head(12)

Out[6]:

	model	condition	window	metric_mean	metric_std	cell
0	model_a	condition_1	long	4.865	0.549675	4.865 ± 0.550
1	model_a	condition_1	short	4.785	0.549675	4.785 ± 0.550
2	model_a	condition_2	long	4.615	0.549675	4.615 ± 0.550
3	model_a	condition_2	short	4.535	0.549675	4.535 ± 0.550
4	model_a	condition_3	long	4.445	0.549675	4.445 ± 0.550
5	model_a	condition_3	short	4.365	0.549675	4.365 ± 0.550
6	model_b	condition_1	long	4.715	0.549675	4.715 ± 0.550
7	model_b	condition_1	short	4.635	0.549675	4.635 ± 0.550
8	model_b	condition_2	long	4.465	0.549675	4.465 ± 0.550
9	model_b	condition_2	short	4.385	0.549675	4.385 ± 0.550
10	model_b	condition_3	long	4.295	0.549675	4.295 ± 0.550
11	model_b	condition_3	short	4.215	0.549675	4.215 ± 0.550

In [7]:

summary_path = outputs_dir / "grouped_metric_summary.csv"
summary.to_csv(summary_path, index=False)

print(summary_path)

outputs/statistical_tests_reference/grouped_metric_summary.csv

If the input rows are already averaged values, the standard deviation is not a raw observation-level standard deviation.

The table label should state the aggregation unit clearly.

Friedman test across several paired conditions

The Friedman test is a rank-based repeated-measures test.

Use it when the same units are measured under three or more conditions.

In [8]:

stat, p_value = friedmanchisquare(
    wide["condition_1"].values,
    wide["condition_2"].values,
    wide["condition_3"].values,
)

print("Friedman statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))

Friedman statistic: 16.0
p-value: 0.000335

The Friedman test gives one overall p-value.

It asks whether the condition columns have a systematic rank difference. It does not say which specific condition pair is different.

Friedman interpretation table

In [9]:

friedman_reference = pd.DataFrame({
    "part": [
        "Null hypothesis",
        "Alternative",
        "Input",
        "Output",
        "Next step",
    ],
    "meaning": [
        "The paired conditions have no systematic rank difference.",
        "At least one condition differs in rank pattern.",
        "Three or more paired columns.",
        "One statistic and one p-value.",
        "Use paired follow-up tests if the result is significant.",
    ],
})

friedman_reference

Out[9]:

	part	meaning
0	Null hypothesis	The paired conditions have no systematic rank ...
1	Alternative	At least one condition differs in rank pattern.
2	Input	Three or more paired columns.
3	Output	One statistic and one p-value.
4	Next step	Use paired follow-up tests if the result is si...

Wilcoxon signed-rank test

The Wilcoxon signed-rank test is a paired two-condition test.

It starts with paired differences, ranks the absolute difference sizes, restores the signs, and checks whether signed ranks are balanced around zero.

In [10]:

a = wide["condition_1"]
b = wide["condition_2"]

stat, p_value = wilcoxon(a, b)

print("Wilcoxon statistic:", round(stat, 6))
print("p-value:", round(p_value, 6))

Wilcoxon statistic: 0.0
p-value: 0.007812

The default SciPy test is two-sided.

The null hypothesis is that the median paired difference is zero.

How Wilcoxon ranks differences

Wilcoxon does not only count positive and negative signs. It ranks the absolute difference sizes, then uses the signs.

In [11]:

example = pd.DataFrame({
    "a": [4.2, 3.6, 5.1, 4.0, 3.9],
    "b": [3.9, 3.4, 5.0, 4.5, 3.8],
})

example["diff"] = example["a"] - example["b"]
example["abs_diff"] = example["diff"].abs()
example["rank_abs_diff"] = example["abs_diff"].rank()

example

Out[11]:

	a	b	diff	abs_diff	rank_abs_diff
0	4.2	3.9	0.3	0.3	4.0
1	3.6	3.4	0.2	0.2	3.0
2	5.1	5.0	0.1	0.1	1.0
3	4.0	4.5	-0.5	0.5	5.0
4	3.9	3.8	0.1	0.1	2.0

Positive differences mean a is larger than b.

Negative differences mean a is smaller than b.

For lower-is-better metrics, a negative difference can mean a performs better than b.

Wilcoxon assumptions

In [12]:

wilcoxon_assumptions = pd.DataFrame({
    "assumption": [
        "Paired observations",
        "Independent pairs",
        "Rankable differences",
        "Rough symmetry",
    ],
    "meaning": [
        "The same units appear in both conditions.",
        "One unit should not duplicate another unit.",
        "Differences should be numeric or ordinal enough to rank.",
        "The paired difference distribution should be roughly symmetric around the median.",
    ],
})

wilcoxon_assumptions

Out[12]:

	assumption	meaning
0	Paired observations	The same units appear in both conditions.
1	Independent pairs	One unit should not duplicate another unit.
2	Rankable differences	Differences should be numeric or ordinal enoug...
3	Rough symmetry	The paired difference distribution should be r...

Bonferroni correction

Bonferroni correction adjusts p-values when multiple tests are run.

It multiplies each raw p-value by the number of tests and caps the result at 1.0.

In [13]:

raw_p_values = [0.012, 0.030, 0.200, 0.004]
n_tests = len(raw_p_values)

corrected = [min(p * n_tests, 1.0) for p in raw_p_values]

pd.DataFrame({
    "raw_p_value": raw_p_values,
    "bonferroni_p_value": corrected,
})

Out[13]:

	raw_p_value	bonferroni_p_value
0	0.012	0.048
1	0.030	0.120
2	0.200	0.800
3	0.004	0.016

Bonferroni reduces false positives from repeated testing.

It is simple and strict. For many pairwise comparisons, report both raw and adjusted p-values.

Friedman plus Wilcoxon workflow

A common workflow is:

run Friedman as the overall test
if significant, run selected Wilcoxon follow-up tests
adjust follow-up p-values

In [14]:

conditions = ["condition_1", "condition_2", "condition_3"]

stat, p_overall = friedmanchisquare(*[wide[c].values for c in conditions])

rows = []

if p_overall < 0.05:
    pairs = list(combinations(conditions, 2))

    for left, right in pairs:
        stat_pair, p_raw = wilcoxon(wide[left], wide[right])
        p_adjusted = min(p_raw * len(pairs), 1.0)

        rows.append({
            "left": left,
            "right": right,
            "wilcoxon_stat": stat_pair,
            "raw_p_value": p_raw,
            "bonferroni_p_value": p_adjusted,
        })

follow_up = pd.DataFrame(rows)

print("Friedman p-value:", p_overall)

follow_up

Friedman p-value: 0.0003354626279025119

Out[14]:

	left	right	raw_p_value	bonferroni_p_value
0	condition_1	condition_2	0.007812	0.023438
1	condition_1	condition_3	0.007812	0.023438
2	condition_2	condition_3	0.007812	0.023438

In [15]:

follow_up_path = outputs_dir / "friedman_wilcoxon_follow_up.csv"
follow_up.to_csv(follow_up_path, index=False)

print(follow_up_path)

outputs/statistical_tests_reference/friedman_wilcoxon_follow_up.csv

Pairwise model comparison

To compare several models, build one column per model and keep rows matched by the same unit.

This example compares models under one selected condition.

In [16]:

selected_condition = "condition_2"

model_condition = (
    raw[raw["condition"] == selected_condition]
    .groupby(["unit", "model"], as_index=False)["metric"]
    .mean()
)

model_scores = (
    model_condition
    .pivot(index="unit", columns="model", values="metric")
    .dropna()
)

model_scores

Out[16]:

model	model_a	model_b	model_c
unit
unit_01	4.25	4.1	4.33
unit_02	3.85	3.7	3.93
unit_03	5.15	5.0	5.23
unit_04	4.65	4.5	4.73
unit_05	3.95	3.8	4.03
unit_06	4.95	4.8	5.03
unit_07	4.45	4.3	4.53
unit_08	5.35	5.2	5.43

In [17]:

models = model_scores.columns.tolist()
pairwise_rows = []

for left, right in combinations(models, 2):
    s1 = model_scores[left].dropna()
    s2 = model_scores[right].dropna()

    idx = s1.index.intersection(s2.index)

    stat, p_value = wilcoxon(s1.loc[idx], s2.loc[idx])

    pairwise_rows.append({
        "left": left,
        "right": right,
        "n_units": len(idx),
        "wilcoxon_stat": stat,
        "raw_p_value": p_value,
    })

pairwise_model_tests = pd.DataFrame(pairwise_rows)
pairwise_model_tests["bonferroni_p_value"] = pairwise_model_tests["raw_p_value"].apply(
    lambda p: min(p * len(pairwise_model_tests), 1.0)
)

pairwise_model_tests

Out[17]:

	left	right	n_units	raw_p_value	bonferroni_p_value
0	model_a	model_b	8	0.007812	0.023438
1	model_a	model_c	8	0.007812	0.023438
2	model_b	model_c	8	0.007812	0.023438

In [18]:

pairwise_path = outputs_dir / "pairwise_model_tests.csv"
pairwise_model_tests.to_csv(pairwise_path, index=False)

print(pairwise_path)

outputs/statistical_tests_reference/pairwise_model_tests.csv

Each row must refer to the same unit across models.

Use multiple-comparison correction when making strict significance claims across many pairs.

Choose best condition before model comparison

When each model has several conditions, one practical comparison is to choose each model's best average condition first.

Then compare those selected best configurations across matched units.

In [19]:

best_conditions = {}

for model_name in raw["model"].unique():
    sub = raw[raw["model"] == model_name]
    best_conditions[model_name] = (
        sub
        .groupby("condition", as_index=False)["metric"]
        .mean()
        .sort_values("metric")
        .iloc[0]["condition"]
    )

best_table = pd.DataFrame()

for model_name, condition_name in best_conditions.items():
    sub = raw[(raw["model"] == model_name) & (raw["condition"] == condition_name)]
    series = (
        sub
        .groupby("unit")["metric"]
        .mean()
        .rename(model_name)
    )
    best_table = pd.concat([best_table, series], axis=1)

print(best_conditions)

best_table

{'model_a': 'condition_3', 'model_b': 'condition_3', 'model_c': 'condition_3'}

Out[19]:

	model_a	model_b	model_c
unit_01	4.08	3.93	4.16
unit_02	3.68	3.53	3.76
unit_03	4.98	4.83	5.06
unit_04	4.48	4.33	4.56
unit_05	3.78	3.63	3.86
unit_06	4.78	4.63	4.86
unit_07	4.28	4.13	4.36
unit_08	5.18	5.03	5.26

In [20]:

best_rows = []

for left, right in combinations(best_table.columns.tolist(), 2):
    idx = best_table[left].dropna().index.intersection(best_table[right].dropna().index)

    stat, p_value = wilcoxon(best_table.loc[idx, left], best_table.loc[idx, right])

    best_rows.append({
        "left": left,
        "left_condition": best_conditions[left],
        "right": right,
        "right_condition": best_conditions[right],
        "n_units": len(idx),
        "wilcoxon_stat": stat,
        "raw_p_value": p_value,
    })

best_condition_tests = pd.DataFrame(best_rows)
best_condition_tests["bonferroni_p_value"] = best_condition_tests["raw_p_value"].apply(
    lambda p: min(p * len(best_condition_tests), 1.0)
)

best_condition_tests

Out[20]:

	left	left_condition	right	right_condition	n_units	raw_p_value	bonferroni_p_value
0	model_a	condition_3	model_b	condition_3	8	0.007812	0.023438
1	model_a	condition_3	model_c	condition_3	8	0.007812	0.023438
2	model_b	condition_3	model_c	condition_3	8	0.007812	0.023438