Random Forest Regression Model Reference
This notebook is a practical reference for Random Forest regression.
It shows:
- how to load a small public regression dataset
- how to train a Random Forest baseline
- what the main parameters mean
- how to evaluate predictions
- how to inspect feature importance
- how to save outputs
Install packages
In [ ]:
pip install pandas numpy scikit-learn matplotlib
Create folders
In [1]:
from pathlib import Path
data_dir = Path("data")
output_dir = Path("outputs")
data_dir.mkdir(parents=True, exist_ok=True)
output_dir.mkdir(parents=True, exist_ok=True)
# print(data_dir.resolve())
# print(output_dir.resolve())
Load a small public dataset
In [2]:
import pandas as pd
from sklearn.datasets import load_diabetes
dataset = load_diabetes(as_frame=True)
df = dataset.frame.copy()
df.to_csv(data_dir / "diabetes_regression_sample.csv", index=False)
print(df.shape)
df.head()
(442, 11)
Out[2]:
| age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019907 | -0.017646 | 151.0 |
| 1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068332 | -0.092204 | 75.0 |
| 2 | 0.085299 | 0.050680 | 0.044451 | -0.005670 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002861 | -0.025930 | 141.0 |
| 3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022688 | -0.009362 | 206.0 |
| 4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031988 | -0.046641 | 135.0 |
Prepare train and validation data
In [3]:
from sklearn.model_selection import train_test_split
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_valid, y_train, y_valid = train_test_split(
X,
y,
test_size=0.25,
random_state=42
)
print(X_train.shape)
print(X_valid.shape)
(331, 10) (111, 10)
Train a Random Forest regressor
Random Forest builds many decision trees independently and averages their predictions.
In [4]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(
n_estimators=300,
max_depth=5,
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
pred = rf.predict(X_valid)
pred[:5]
Out[4]:
array([147.99949333, 176.73596358, 149.98966143, 245.84321372,
118.44289185])
Evaluate the model
In [5]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
metrics = pd.DataFrame([
{
"model": "Random Forest",
"n_estimators": 300,
"max_depth": 5,
"min_samples_leaf": 1,
"mae": mean_absolute_error(y_valid, pred),
"rmse": np.sqrt(mean_squared_error(y_valid, pred)),
"r2": r2_score(y_valid, pred)
}
])
metrics
Out[5]:
| model | n_estimators | max_depth | min_samples_leaf | mae | rmse | r2 | |
|---|---|---|---|---|---|---|---|
| 0 | Random Forest | 300 | 5 | 1 | 42.191033 | 52.969084 | 0.492607 |
Inspect feature importance
Random Forest can report impurity-based feature importance. Treat it as a quick diagnostic, not as a complete causal explanation.
In [6]:
feature_importance = pd.DataFrame({
"feature": X_train.columns,
"importance": rf.feature_importances_
}).sort_values("importance", ascending=False)
feature_importance
Out[6]:
| feature | importance | |
|---|---|---|
| 2 | bmi | 0.403142 |
| 8 | s5 | 0.250489 |
| 3 | bp | 0.103107 |
| 9 | s6 | 0.052068 |
| 0 | age | 0.041128 |
| 6 | s3 | 0.040162 |
| 5 | s2 | 0.039256 |
| 4 | s1 | 0.036932 |
| 7 | s4 | 0.027597 |
| 1 | sex | 0.006120 |
Plot feature importance
In [7]:
feature_importance.sort_values("importance", ascending=True)
Out[7]:
| feature | importance | |
|---|---|---|
| 1 | sex | 0.006120 |
| 7 | s4 | 0.027597 |
| 4 | s1 | 0.036932 |
| 5 | s2 | 0.039256 |
| 6 | s3 | 0.040162 |
| 0 | age | 0.041128 |
| 9 | s6 | 0.052068 |
| 3 | bp | 0.103107 |
| 8 | s5 | 0.250489 |
| 2 | bmi | 0.403142 |
Compare depth settings
In [8]:
depth_results = []
for depth in [2, 3, 5, 8, None]:
model = RandomForestRegressor(
n_estimators=300,
max_depth=depth,
min_samples_leaf=1,
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train)
model_pred = model.predict(X_valid)
depth_results.append({
"max_depth": str(depth),
"mae": mean_absolute_error(y_valid, model_pred),
"rmse": np.sqrt(mean_squared_error(y_valid, model_pred)),
"r2": r2_score(y_valid, model_pred)
})
depth_comparison = pd.DataFrame(depth_results)
depth_comparison
Out[8]:
| max_depth | mae | rmse | r2 | |
|---|---|---|---|---|
| 0 | 2 | 43.480285 | 53.808550 | 0.476397 |
| 1 | 3 | 41.925885 | 52.445176 | 0.502595 |
| 2 | 5 | 42.191033 | 52.969084 | 0.492607 |
| 3 | 8 | 42.949263 | 53.643369 | 0.479607 |
| 4 | None | 43.096336 | 53.925084 | 0.474127 |