Ctrl K

KNN Regression Model Reference

This notebook is a practical reference for training and evaluating a K Nearest Neighbors regression model.

It shows:

  • how to load a small public regression dataset
  • how to split train and validation rows
  • why scaling matters for distance-based models
  • how to use StandardScaler inside a scikit-learn Pipeline
  • how to compare Manhattan and Euclidean distance
  • how to save predictions and metrics

Install packages

Run this once in your notebook environment if the packages are missing.

In [ ]:
pip install pandas numpy scikit-learn matplotlib

Create folders

In [2]:
from pathlib import Path

data_dir = Path("data")
output_dir = Path("outputs")

data_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)

# print(data_dir.resolve())
# print(output_dir.resolve())

Load a small public dataset

The diabetes regression dataset is included with scikit-learn. It is small, numeric, and useful for model reference notebooks because it does not require a network download.

In [3]:
import pandas as pd
from sklearn.datasets import load_diabetes

dataset = load_diabetes(as_frame=True)

df = dataset.frame.copy()
df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False)

print(df.shape)
df.head()
(442, 11)
Out[3]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0

Prepare train and validation data

In [4]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["target"])
y = df["target"]

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

print(X_train.shape)
print(X_valid.shape)
(331, 10)
(111, 10)

Why scaling matters for KNN

KNN uses distance. If one feature has a much larger scale than another feature, it can dominate the neighbor search. A scaler should be fitted on training data only.

In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

knn_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=7))
])

knn_pipeline.fit(X_train, y_train)

pred = knn_pipeline.predict(X_valid)

pred[:5]
Out[5]:
array([148.42857143, 147.28571429, 138.71428571, 239.        ,
       133.57142857])

Evaluate the model

In [6]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_valid, pred)
rmse = np.sqrt(mean_squared_error(y_valid, pred))
r2 = r2_score(y_valid, pred)

metrics = pd.DataFrame([
    {
        "model": "KNN",
        "n_neighbors": 7,
        "mae": mae,
        "rmse": rmse,
        "r2": r2
    }
])

metrics
Out[6]:
model n_neighbors mae rmse r2
0 KNN 7 41.189189 54.673767 0.459423

Compare distance settings

p=1 uses Manhattan distance. p=2 uses Euclidean distance.

In [7]:
results = []

for p in [1, 2]:
    model = Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsRegressor(
            n_neighbors=7,
            weights="uniform",
            p=p
        ))
    ])

    model.fit(X_train, y_train)
    model_pred = model.predict(X_valid)

    results.append({
        "distance": "manhattan" if p == 1 else "euclidean",
        "p": p,
        "mae": mean_absolute_error(y_valid, model_pred),
        "rmse": np.sqrt(mean_squared_error(y_valid, model_pred)),
        "r2": r2_score(y_valid, model_pred)
    })

distance_results = pd.DataFrame(results)
distance_results
Out[7]:
distance p mae rmse r2
0 manhattan 1 42.805663 54.485006 0.463150
1 euclidean 2 41.189189 54.673767 0.459423

Manual Manhattan and Euclidean distance

In [8]:
a = np.array([1, 2, 3])
b = np.array([4, 6, 5])

manhattan = np.sum(np.abs(a - b))
euclidean = np.sqrt(np.sum((a - b) ** 2))

print(manhattan)
print(euclidean)
9
5.385164807134504