Reputation: 105
I have been investigating a "hand-rolled" version of a gradient boosted regression tree. I find that the errors agree very well with the sklearn GradientBoostingRegressor module until I increase the tree building loop above a certain value. I am not sure if this is a bug in my code or a feature of the algorithm that is manifesting itself, so I was looking for some guidance as to what may be happening. My full code listing that uses the Boston housing market data is shown below, and below that the output when I change the loop parameter.
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 44
yhi_1=0
ypT=0
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2, random_state=42)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,random_state=42,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
When the parameter loop=44
the output is
dtL: R^2 = 0.8702681499951852
GBT: R^2 = 0.8702681499951852
R2loop - GBT: 0.0
and the two agree. If I increase the loop parameter to loop=45
I get
dtL: R^2 = 0.8726215419913225
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 0.0005993263531949289
A sudden jump in accuracy between the two algorithms of 15 to 16 decimal places. Any thoughts?
Upvotes: 4
Views: 204
Reputation: 4061
I believe there are two sources of differences here. The biggest one is the randomness in the DecisionTreeRegressor.fit
method. While you set your random seeds to 42 in both the GradientBoostingRegressor
and in all of the
DecisionTreeRegressor
s, your DecisionTreeRegressor
training loop does not duplicate the way GradientBoostingRegressor
handles the random seed. In your loop, you set the seed on each iteration. In the GradientBoostingRegressor.fit
method, the seed is (I assume) set only once at the beginning of training. I've modified your code as follows:
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
X_train,X_test, = train_test_split(X,test_size=0.2,random_state=42)
y_train,y_test, = train_test_split(y,test_size=0.2,random_state=42)
alpha = 0.5
loop = 45
yhi_1=0
ypT=0
np.random.seed(42)
for i in range(loop):
dt = DecisionTreeRegressor(max_depth=2)
ri = y_train - yhi_1
dt.fit(X_train, ri)
hi = dt.predict(X_train)
yhi = yhi_1 + alpha * hi
ypi = dt.predict(X_test)*alpha
ypT = ypT + ypi
yhi_1 = yhi
r2Loop= metrics.r2_score(y_test,ypT)
print("dtL: R^2 = ", r2Loop)
np.random.seed(42)
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=loop, learning_rate=alpha,init="zero")
gbrt.fit(X_train,y_train)
gbrt.loss
y_pred = gbrt.predict(X_test)
r2GBRT= metrics.r2_score(y_test,y_pred)
print("GBT: R^2 = ", r2GBRT)
print("R2loop - GBT: ", r2Loop - r2GBRT)
The only difference is in how I set the random seeds. I'm now using numpy
to set the seed before each training loop. By making this change, I get the following output with loop = 45
:
dtL: R^2 = 0.8720222156381277
GBT: R^2 = 0.8720222156381275
R2loop - GBT: 1.1102230246251565e-16
which is within reason for floating point errors (the other source of differences I was referring to in my first sentence), and for many values of loop
I see no difference at all.
Upvotes: 1