Reputation: 142
I am currently having an imbalanced dataset as shown diagram below:
Then, I use the 'is_unbalance' parameter by setting it to True
when training the LightGBM model. Diagrams below show how I use this parameter.
Example of using sckit-learnAPI:
My questions are:
is_unbalance
parameter correct?scale_pos_weight
instead of is_unbalance
?Thanks!
Upvotes: 4
Views: 18770
Reputation: 91
Simply put, if you set is_unbalance=True
, the model automatically uses scale_pos_weight
with value equals to (num of negative samples)/(num of positive samples)
(e.g. if 800 negative samples, 200 positive samples, then scale_pos_weight = 4
).
So regarding your questions:
is_unbalance
is right (don't use both is_unbalance
and scale_pos_weight
at once)scale_pos_weight
to a value different from the one I mentioned above, as far as I know there is no other typical values. You can try to increase/decrease and check if the model improves -- it's just all down to trial and error.Upvotes: 1
Reputation: 969
This answer might be good for you question about is_unbalance: Use of 'is_unbalance' parameter in Lightgbm
You're not necessarily using the is_unbalance incorrectly, but sample_pos_weight will provide you a better control of weights of minority and majority class.
At this link there is a good explanation about the scale_pos_weight use: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets
Basically, the scale_pos_weight allows to set a configurable weight for the minority class, as a target variable. A good discussion about this topic is here https://discuss.xgboost.ai/t/how-does-scale-pos-weight-affect-probabilities/1790/4.
About the SMOTE, I can't provide you theoretical proof about it, but considering my experience, everytime I tried to use it to improve any model performance using SMOTE, it failed.
A better approach might be to decide carefully which metric must be optimized. Better metrics for unbalanced problems are f1-score and also recall. In general, AUC, and Accuracy will be a bad choice. Also the -micro and weighted metrics are good metrics to use as objective when searching for hyperparameters)
Machine Learning Mastery provides a good explanation and implementation code about micro, macro and weighted metrics: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
Upvotes: 13