Johnny Cheesecutter
Johnny Cheesecutter

Reputation: 2853

Implement custom huber loss in lightgbm

I'm trying to implement Huber loss to make customization for MAPE loss in lightgbm. Below is my code. However, when I try to run it I get zeros for all predictions. What is wrong with the code? It seems that some scalling could a bit help with learning, but I don't see any guidelines in the internet on how it should be applied inside customized loss. Could you please help me with that?

def my_loss(preds, dtrain):

   y_true = dtrain.get_label()
   d = (preds - y_true)
   h = 1  #h is delta in the graphic
   scale = 1 + (d / h) ** 2
   scale_sqrt = np.sqrt(scale)
   grad = d / scale_sqrt 
   hess = 1 / scale / scale_sqrt 

   hess = np.ones(len(preds))

return grad, hess

metrics = []
for i in my_cv:
   X_train = X.loc[i[0],:]
   y_train = y.loc[i[0]]
   X_test = X.loc[i[1],:]
   y_test = y.loc[i[1]]


   dtrain = xgb.Dataset(X_train, label=y_train, free_raw_data =False)


   params = {'max_depth': 10, 'learning_rate':0.05,'objective':None,
         'num_leaves':150, 'min_child_samples':5, 'nround':100,
         'monotone_constraints':lst_mon}

   mm = xgb.train(params, dtrain, fobj = my_loss)
   y_pred = mm.predict(X_train)

Upvotes: 2

Views: 2958

Answers (3)

Marco Cerliani
Marco Cerliani

Reputation: 22021

the correct function:

def my_loss(preds, dtrain):

   y_true = dtrain.get_label()
   d = (preds - y_true)
   h = 1  #h is delta in the graphic
   scale = 1 + (d / h) ** 2
   scale_sqrt = np.sqrt(scale)
   grad = d / scale_sqrt 
   hess = 1 / scale / scale_sqrt 

   return grad, hess

removed hess = np.ones(len(preds))

Upvotes: 2

Viktoriya Malyasova
Viktoriya Malyasova

Reputation: 1425

Huber loss is defined as

enter image description here

The loss you've implemented is its smooth approximation, the Pseudo-Huber loss: enter image description here

The problem with this loss is that its second derivative gets too close to zero. To speed up their algorithm, lightgbm uses Newton method's approximation to find the optimal leaf value:

y = - L' / L''

(See this blogpost for details).

I.e. they find a point where a parabola with the same gradient and second derivative would reach minimum. If the loss function is quadratic, this gives us the exact optimal value. For the Pseudo-Huber loss, however, Newton's method diverges everywhere:

|- L'(a) / L''(a)| = (1 + (a/delta)**2) * |a| > |a|,

so the approximation you get is always even farther from the minimum than the value you started with.

When you use np.ones for hessian, you get -L'(a) as an estimate for zero instead, which does not converge to zero either.

To properly implement gradient boosting with Pseudo-Huber loss you have to give up using hessians and use normal gradient descent to find the optimal leaf value. You cannot do it in the lightgbm's custom loss, but lightgbm has a built-in huber loss, so you can use that.

Upvotes: 4

Grzegorz Sionkowski
Grzegorz Sionkowski

Reputation: 529

It may be the effect of forced monotone_constraints. They should be set only if you have obtained acceptable results and want to improve it. After deep analysis of data and results.

Additionally (probably it is only an error during copying the code to SO), in your loss function, all hess values are constant through the whole training process due to hess = np.ones(len(preds)).

Upvotes: 0

Related Questions