Ctrl K

XGBoost Regression Model Reference

This notebook is a practical reference for XGBoost regression.

It shows:

  • how to load a small public regression dataset
  • how boosted trees differ from Random Forest
  • how to train XGBRegressor
  • how to evaluate predictions
  • how to inspect feature importance
  • how to save outputs

Install packages

In [10]:
# pip install pandas numpy scikit-learn matplotlib xgboost

Create folders

In [2]:
from pathlib import Path

data_dir = Path("data")
output_dir = Path("outputs")

data_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)

# print(data_dir.resolve())
# print(output_dir.resolve())

Load a small public dataset

In [3]:
import pandas as pd
from sklearn.datasets import load_diabetes

dataset = load_diabetes(as_frame=True)

df = dataset.frame.copy()
df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False)

print(df.shape)
df.head()
(442, 11)
Out[3]:
age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0

Prepare train and validation data

In [4]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["target"])
y = df["target"]

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42
)

print(X_train.shape)
print(X_valid.shape)
(331, 10)
(111, 10)

Train an XGBoost regressor

XGBoost builds trees sequentially. Each new tree tries to correct errors left by the previous trees.

In [5]:
from xgboost import XGBRegressor

xgb = XGBRegressor(
    n_estimators=300,
    max_depth=3,
    learning_rate=0.03,
    subsample=0.9,
    colsample_bytree=0.9,
    reg_lambda=1.0,
    random_state=42,
    n_jobs=-1,
    tree_method="hist",
    objective="reg:squarederror"
)

xgb.fit(X_train, y_train)

pred = xgb.predict(X_valid)

pred[:5]
Out[5]:
array([152.07732, 201.66719, 159.84453, 278.6145 , 125.21937],
      dtype=float32)

Evaluate the model

In [6]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

metrics = pd.DataFrame([
    {
        "model": "XGBoost",
        "n_estimators": 300,
        "max_depth": 3,
        "learning_rate": 0.03,
        "mae": mean_absolute_error(y_valid, pred),
        "rmse": np.sqrt(mean_squared_error(y_valid, pred)),
        "r2": r2_score(y_valid, pred)
    }
])

metrics
Out[6]:
model n_estimators max_depth learning_rate mae rmse r2
0 XGBoost 300 3 0.03 43.667246 55.508984 0.442781

Compare learning rates

In [7]:
rate_results = []

for rate in [0.01, 0.03, 0.05, 0.10]:
    model = XGBRegressor(
        n_estimators=300,
        max_depth=3,
        learning_rate=rate,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        random_state=42,
        n_jobs=-1,
        tree_method="hist",
        objective="reg:squarederror"
    )

    model.fit(X_train, y_train)
    model_pred = model.predict(X_valid)

    rate_results.append({
        "learning_rate": rate,
        "mae": mean_absolute_error(y_valid, model_pred),
        "rmse": np.sqrt(mean_squared_error(y_valid, model_pred)),
        "r2": r2_score(y_valid, model_pred)
    })

learning_rate_comparison = pd.DataFrame(rate_results)
learning_rate_comparison
Out[7]:
learning_rate mae rmse r2
0 0.01 42.759235 53.331663 0.485637
1 0.03 43.667246 55.508984 0.442781
2 0.05 45.044640 57.439444 0.403350
3 0.10 45.636429 58.078716 0.389995

Inspect feature importance

In [8]:
feature_importance = pd.DataFrame({
    "feature": X_train.columns,
    "importance": xgb.feature_importances_
}).sort_values("importance", ascending=False)

feature_importance
Out[8]:
feature importance
2 bmi 0.268786
8 s5 0.189291
3 bp 0.097130
7 s4 0.079540
6 s3 0.075476
9 s6 0.070914
4 s1 0.066521
0 age 0.051980
5 s2 0.050685
1 sex 0.049677

Feature importance

In [9]:
feature_importance.sort_values("importance", ascending=True)
Out[9]:
feature importance
1 sex 0.049677
5 s2 0.050685
0 age 0.051980
4 s1 0.066521
9 s6 0.070914
6 s3 0.075476
7 s4 0.079540
3 bp 0.097130
8 s5 0.189291
2 bmi 0.268786