KNN Regression Model Reference
This notebook is a practical reference for training and evaluating a K Nearest Neighbors regression model.
It shows:
- how to load a small public regression dataset
- how to split train and validation rows
- why scaling matters for distance-based models
- how to use
StandardScalerinside a scikit-learnPipeline - how to compare Manhattan and Euclidean distance
- how to save predictions and metrics
Install packages
Run this once in your notebook environment if the packages are missing.
In [ ]:
pip install pandas numpy scikit-learn matplotlib
Create folders
In [2]:
from pathlib import Path
data_dir = Path("data")
output_dir = Path("outputs")
data_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)
# print(data_dir.resolve())
# print(output_dir.resolve())
Load a small public dataset
The diabetes regression dataset is included with scikit-learn. It is small, numeric, and useful for model reference notebooks because it does not require a network download.
In [3]:
import pandas as pd
from sklearn.datasets import load_diabetes
dataset = load_diabetes(as_frame=True)
df = dataset.frame.copy()
df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False)
print(df.shape)
df.head()
(442, 11)
Out[3]:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
Prepare train and validation data
In [4]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_valid, y_train, y_valid = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
print(X_train.shape)
print(X_valid.shape)
(331, 10) (111, 10)
Why scaling matters for KNN
KNN uses distance. If one feature has a much larger scale than another feature, it can dominate the neighbor search. A scaler should be fitted on training data only.
In [5]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
knn_pipeline = Pipeline([
("scaler", StandardScaler()),
("model", KNeighborsRegressor(n_neighbors=7))
])
knn_pipeline.fit(X_train, y_train)
pred = knn_pipeline.predict(X_valid)
pred[:5]
Out[5]:
array([148.42857143, 147.28571429, 138.71428571, 239. ,
133.57142857])
Evaluate the model
In [6]:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(y_valid, pred)
rmse = np.sqrt(mean_squared_error(y_valid, pred))
r2 = r2_score(y_valid, pred)
metrics = pd.DataFrame([
{
"model": "KNN",
"n_neighbors": 7,
"mae": mae,
"rmse": rmse,
"r2": r2
}
])
metrics
Out[6]:
| model | n_neighbors | mae | rmse | r2 | |
|---|---|---|---|---|---|
| 0 | KNN | 7 | 41.189189 | 54.673767 | 0.459423 |
Compare distance settings
p=1 uses Manhattan distance. p=2 uses Euclidean distance.
In [7]:
results = []
for p in [1, 2]:
model = Pipeline([
("scaler", StandardScaler()),
("model", KNeighborsRegressor(
n_neighbors=7,
weights="uniform",
p=p
))
])
model.fit(X_train, y_train)
model_pred = model.predict(X_valid)
results.append({
"distance": "manhattan" if p == 1 else "euclidean",
"p": p,
"mae": mean_absolute_error(y_valid, model_pred),
"rmse": np.sqrt(mean_squared_error(y_valid, model_pred)),
"r2": r2_score(y_valid, model_pred)
})
distance_results = pd.DataFrame(results)
distance_results
Out[7]:
| distance | p | mae | rmse | r2 | |
|---|---|---|---|---|---|
| 0 | manhattan | 1 | 42.805663 | 54.485006 | 0.463150 |
| 1 | euclidean | 2 | 41.189189 | 54.673767 | 0.459423 |
Manual Manhattan and Euclidean distance
In [8]:
a = np.array([1, 2, 3])
b = np.array([4, 6, 5])
manhattan = np.sum(np.abs(a - b))
euclidean = np.sqrt(np.sum((a - b) ** 2))
print(manhattan)
print(euclidean)
9 5.385164807134504