Reputation: 352
I am trying to use the 'is_unbalance' parameter in my model training for a binary classification problem where the positive class is approximately 3%. If I set the parameter 'is_unbalance', I observe that the binary log loss drops in the first iteration but then keeps on increasing. I'm noticing this behavior only if I enable this parameter 'is_unbalance'. Otherwise, there is a steady drop in log_loss. Appreciate your help on this. Thanks.
Upvotes: 7
Views: 14745
Reputation: 684
When you do not balance the sets for such an unbalanced dataset, then obviously the objective value will always drop - and will probably reach the point of classifying all the predictions to the majority class, while having a fantastic objective value.
Balancing the classes is necessary, but it doesn't mean that you should stop on is_unbalance
- you can use sample_pos_weight
, have customized metric, or apply weights to your samples, like following:
WEIGHTS = y_train.value_counts(normalize = True).min() / y_train.value_counts(normalize = True)
TRAIN_WEIGHTS = pd.DataFrame(y_train.rename('old_target')).merge(WEIGHTS, how = 'left', left_on = 'old_target', right_on = WEIGHTS.index).target.values
train_data = lgb.Dataset(X_train, label=y_train, weight = TRAIN_WEIGHTS)
Also, optimizing other hyperparameters should solve the issue of increasing log_loss
.
Upvotes: 4
Reputation: 388
When you set is_unbalance
: True, the algorithm will try to Automatically balance the weight of the dominated label (with the pos/neg fraction in train set).
If you want change scale_pos_weight (it is by default 1 which mean assume both positive and negative label are equal) in case of unbalance dataset you can use following formula(based on this issue on lightgbm repository) to set it correctly.
sample_pos_weight = number of negative samples / number of positive samples
Upvotes: 1