Ming Jun Lim
Ming Jun Lim

Reputation: 142

How to use "is_unbalance" and "scale_pos_weight" parameters in LightGBM for a binary classification project that is unbalanced (80:20)

I am currently having an imbalanced dataset as shown diagram below: Distribution of target feature

Then, I use the 'is_unbalance' parameter by setting it to True when training the LightGBM model. Diagrams below show how I use this parameter.

Example of using native API: Example of sing native API

Example of using sckit-learnAPI: Example of using sckit-learnAPI

My questions are:

  1. Is the way I apply the use of is_unbalance parameter correct?
  2. How to use scale_pos_weight instead of is_unbalance?
  3. Or I should balance the dataset using SMOTE techniques like SMOTE-ENN or SMOTE+TOME?

Thanks!

Upvotes: 4

Views: 18770

Answers (2)

Loc Quan
Loc Quan

Reputation: 91

Simply put, if you set is_unbalance=True, the model automatically uses scale_pos_weight with value equals to (num of negative samples)/(num of positive samples) (e.g. if 800 negative samples, 200 positive samples, then scale_pos_weight = 4).

So regarding your questions:

  1. Your usage of is_unbalance is right (don't use both is_unbalance and scale_pos_weight at once)
  2. If you decide to change scale_pos_weight to a value different from the one I mentioned above, as far as I know there is no other typical values. You can try to increase/decrease and check if the model improves -- it's just all down to trial and error.
  3. I think you shouldn't try out SMOTE (unless for educational purposes, or you really have the spare time/resources). Because SMOTE usually doesn't help, and likely makes things worse by creating some noise -- that's also why many people are so against using it. It's good for most datasets to not applying any resampling techniques (paper: Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants) and trying to balance the data is generally not recommended (paper: To SMOTE, or not to SMOTE?). However, as the paper authors stated, in some cases it could help model performance, so if you have some time/resources to spare, why not trying it out right? Machine Learning is just about trial and error after all.

Upvotes: 1

Vitor Pereira Barbosa
Vitor Pereira Barbosa

Reputation: 969

This answer might be good for you question about is_unbalance: Use of 'is_unbalance' parameter in Lightgbm

You're not necessarily using the is_unbalance incorrectly, but sample_pos_weight will provide you a better control of weights of minority and majority class.

At this link there is a good explanation about the scale_pos_weight use: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets

Basically, the scale_pos_weight allows to set a configurable weight for the minority class, as a target variable. A good discussion about this topic is here https://discuss.xgboost.ai/t/how-does-scale-pos-weight-affect-probabilities/1790/4.

About the SMOTE, I can't provide you theoretical proof about it, but considering my experience, everytime I tried to use it to improve any model performance using SMOTE, it failed.

A better approach might be to decide carefully which metric must be optimized. Better metrics for unbalanced problems are f1-score and also recall. In general, AUC, and Accuracy will be a bad choice. Also the -micro and weighted metrics are good metrics to use as objective when searching for hyperparameters)

Machine Learning Mastery provides a good explanation and implementation code about micro, macro and weighted metrics: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/

Upvotes: 13

Related Questions