KNN Regression Model Reference

This notebook is a practical reference for training and evaluating a K Nearest Neighbors regression model.

It shows:

how to load a small public regression dataset
how to split train and validation rows
why scaling matters for distance-based models
how to use StandardScaler inside a scikit-learn Pipeline
how to compare Manhattan and Euclidean distance
how to save predictions and metrics

Install packages

Run this once in your notebook environment if the packages are missing.

In [ ]:

pip install pandas numpy scikit-learn matplotlib

Create folders

In [2]:

from pathlib import Path

data_dir = Path("data")
output_dir = Path("outputs")

data_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)

# print(data_dir.resolve())
# print(output_dir.resolve())

Load a small public dataset

The diabetes regression dataset is included with scikit-learn. It is small, numeric, and useful for model reference notebooks because it does not require a network download.

In [3]:

import pandas as pd
from sklearn.datasets import load_diabetes

dataset = load_diabetes(as_frame=True)

df = dataset.frame.copy()
df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False)

print(df.shape)
df.head()

(442, 11)

Out[3]:

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
0	0.038076	0.050680	0.061696	0.021872	-0.044223	-0.034821	-0.043401	-0.002592	0.019907	-0.017646	151.0
1	-0.001882	-0.044642	-0.051474	-0.026328	-0.008449	-0.019163	0.074412	-0.039493	-0.068332	-0.092204	75.0
2	0.085299	0.050680	0.044451	-0.005670	-0.045599	-0.034194	-0.032356	-0.002592	0.002861	-0.025930	141.0
3	-0.089063	-0.044642	-0.011595	-0.036656	0.012191	0.024991	-0.036038	0.034309	0.022688	-0.009362	206.0
4	0.005383	-0.044642	-0.036385	0.021872	0.003935	0.015596	0.008142	-0.002592	-0.031988	-0.046641	135.0

Prepare train and validation data

In [4]:

from sklearn.model_selection import train_test_split

X = df.drop(columns=["target"])
y = df["target"]

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

print(X_train.shape)
print(X_valid.shape)

(331, 10)
(111, 10)

Why scaling matters for KNN

KNN uses distance. If one feature has a much larger scale than another feature, it can dominate the neighbor search. A scaler should be fitted on training data only.

In [5]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

knn_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", KNeighborsRegressor(n_neighbors=7))
])

knn_pipeline.fit(X_train, y_train)

pred = knn_pipeline.predict(X_valid)

pred[:5]

Out[5]:

array([148.42857143, 147.28571429, 138.71428571, 239.        ,
       133.57142857])

Evaluate the model

In [6]:

import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_valid, pred)
rmse = np.sqrt(mean_squared_error(y_valid, pred))
r2 = r2_score(y_valid, pred)

metrics = pd.DataFrame([
    {
        "model": "KNN",
        "n_neighbors": 7,
        "mae": mae,
        "rmse": rmse,
        "r2": r2
    }
])

metrics

Out[6]:

	model	n_neighbors	mae	rmse	r2
0	KNN	7	41.189189	54.673767	0.459423

Compare distance settings

p=1 uses Manhattan distance. p=2 uses Euclidean distance.

In [7]:

results = []

for p in [1, 2]:
    model = Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsRegressor(
            n_neighbors=7,
            weights="uniform",
            p=p
        ))
    ])

    model.fit(X_train, y_train)
    model_pred = model.predict(X_valid)

    results.append({
        "distance": "manhattan" if p == 1 else "euclidean",
        "p": p,
        "mae": mean_absolute_error(y_valid, model_pred),
        "rmse": np.sqrt(mean_squared_error(y_valid, model_pred)),
        "r2": r2_score(y_valid, model_pred)
    })

distance_results = pd.DataFrame(results)
distance_results

Out[7]:

	distance	p	mae	rmse	r2
0	manhattan	1	42.805663	54.485006	0.463150
1	euclidean	2	41.189189	54.673767	0.459423

Manual Manhattan and Euclidean distance

In [8]:

a = np.array([1, 2, 3])
b = np.array([4, 6, 5])

manhattan = np.sum(np.abs(a - b))
euclidean = np.sqrt(np.sum((a - b) ** 2))

print(manhattan)
print(euclidean)

9
5.385164807134504

KNN Regression Model Reference

This notebook is a practical reference for training and evaluating a K Nearest Neighbors regression model.

It shows:

how to load a small public regression dataset

how to split train and validation rows

why scaling matters for distance-based models

how to use StandardScaler inside a scikit-learn Pipeline

how to compare Manhattan and Euclidean distance

how to save predictions and metrics

from pathlib import Path data_dir = Path("data") output_dir = Path("outputs") data_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True) # print(data_dir.resolve()) # print(output_dir.resolve())

import pandas as pd from sklearn.datasets import load_diabetes dataset = load_diabetes(as_frame=True) df = dataset.frame.copy() df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False) print(df.shape) df.head()

age

sex

bmi

target

0.038076

0.050680

0.061696

0.021872

-0.044223

-0.034821

-0.043401

-0.002592

0.019907

-0.017646

151.0

-0.001882

-0.044642

-0.051474

-0.026328

-0.008449

-0.019163

0.074412

-0.039493

-0.068332

-0.092204

75.0

0.085299

0.050680

0.044451

-0.005670

-0.045599

-0.034194

-0.032356

-0.002592

0.002861

-0.025930

141.0

-0.089063

-0.044642

-0.011595

-0.036656

0.012191

0.024991

-0.036038

0.034309

0.022688

-0.009362

206.0

0.005383

-0.044642

-0.036385

0.021872

0.003935

0.015596

0.008142

-0.002592

-0.031988

-0.046641

135.0

from sklearn.model_selection import train_test_split X = df.drop(columns=["target"]) y = df["target"] X_train, X_valid, y_train, y_valid = train_test_split( X, y, test_size=0.25, random_state=42 ) print(X_train.shape) print(X_valid.shape)

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsRegressor knn_pipeline = Pipeline([ ("scaler", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=7)) ]) knn_pipeline.fit(X_train, y_train) pred = knn_pipeline.predict(X_valid) pred[:5]

import numpy as np from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score mae = mean_absolute_error(y_valid, pred) rmse = np.sqrt(mean_squared_error(y_valid, pred)) r2 = r2_score(y_valid, pred) metrics = pd.DataFrame([ { "model": "KNN", "n_neighbors": 7, "mae": mae, "rmse": rmse, "r2": r2 } ]) metrics

model

n_neighbors

mae

rmse

KNN

41.189189

54.673767

0.459423

results = [] for p in [1, 2]: model = Pipeline([ ("scaler", StandardScaler()), ("model", KNeighborsRegressor( n_neighbors=7, weights="uniform", p=p )) ]) model.fit(X_train, y_train) model_pred = model.predict(X_valid) results.append({ "distance": "manhattan" if p == 1 else "euclidean", "p": p, "mae": mean_absolute_error(y_valid, model_pred), "rmse": np.sqrt(mean_squared_error(y_valid, model_pred)), "r2": r2_score(y_valid, model_pred) }) distance_results = pd.DataFrame(results) distance_results

distance

mae

rmse

manhattan

42.805663

54.485006

0.463150

euclidean

41.189189

54.673767

0.459423