Reputation: 2853
I'm trying to implement Huber loss to make customization for MAPE loss in lightgbm. Below is my code. However, when I try to run it I get zeros for all predictions. What is wrong with the code? It seems that some scalling could a bit help with learning, but I don't see any guidelines in the internet on how it should be applied inside customized loss. Could you please help me with that?
def my_loss(preds, dtrain):
y_true = dtrain.get_label()
d = (preds - y_true)
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
hess = np.ones(len(preds))
return grad, hess
metrics = []
for i in my_cv:
X_train = X.loc[i[0],:]
y_train = y.loc[i[0]]
X_test = X.loc[i[1],:]
y_test = y.loc[i[1]]
dtrain = xgb.Dataset(X_train, label=y_train, free_raw_data =False)
params = {'max_depth': 10, 'learning_rate':0.05,'objective':None,
'num_leaves':150, 'min_child_samples':5, 'nround':100,
'monotone_constraints':lst_mon}
mm = xgb.train(params, dtrain, fobj = my_loss)
y_pred = mm.predict(X_train)
Upvotes: 2
Views: 2958
Reputation: 22021
the correct function:
def my_loss(preds, dtrain):
y_true = dtrain.get_label()
d = (preds - y_true)
h = 1 #h is delta in the graphic
scale = 1 + (d / h) ** 2
scale_sqrt = np.sqrt(scale)
grad = d / scale_sqrt
hess = 1 / scale / scale_sqrt
return grad, hess
removed hess = np.ones(len(preds))
Upvotes: 2
Reputation: 1425
Huber loss is defined as
The loss you've implemented is its smooth approximation, the Pseudo-Huber loss:
The problem with this loss is that its second derivative gets too close to zero. To speed up their algorithm, lightgbm uses Newton method's approximation to find the optimal leaf value:
y = - L' / L''
(See this blogpost for details).
I.e. they find a point where a parabola with the same gradient and second derivative would reach minimum. If the loss function is quadratic, this gives us the exact optimal value. For the Pseudo-Huber loss, however, Newton's method diverges everywhere:
|- L'(a) / L''(a)| = (1 + (a/delta)**2) * |a| > |a|,
so the approximation you get is always even farther from the minimum than the value you started with.
When you use np.ones
for hessian, you get -L'(a) as an estimate for zero instead, which does not converge to zero either.
To properly implement gradient boosting with Pseudo-Huber loss you have to give up using hessians and use normal gradient descent to find the optimal leaf value. You cannot do it in the lightgbm's custom loss, but lightgbm has a built-in huber loss, so you can use that.
Upvotes: 4
Reputation: 529
It may be the effect of forced monotone_constraints
. They should be set only if you have obtained acceptable results and want to improve it. After deep analysis of data and results.
Additionally (probably it is only an error during copying the code to SO), in your loss function, all hess values are constant through the whole training process due to hess = np.ones(len(preds))
.
Upvotes: 0