Greem666
Greem666

Reputation: 949

ML with imbalanced binary dataset

I have a problem I am trying to solve: - imbalanced dataset with 2 classes - one class dwarfs the other one (923 vs 38) - f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65 range

While doing research on the topic yesterday, I educated myself in resampling and especially SMOTE algorithm. It seems to have worked wonders for my TRAIN score, as after balancing the dataset with them, my score went from ~0.6 up to ~0.97. The way that I have applied it was as follows:

What I would assume happened, is that the holdout data in TEST set held observations, which were vastly different from pre-SMOTE observations of the minority class in TRAIN set, which ended up teaching the model to recognize cases in TRAIN set really well, but threw the model off-balance with these few outliers in the TEST set.

What are the common strategies to deal with this problem? Common sense would dictate that I should try and capture a very representative sample of minority class in the TRAIN set, but I do not think that sklearn has any automated tools which allow that to happen?

Upvotes: 0

Views: 273

Answers (1)

secretive
secretive

Reputation: 2112

Your assumption is correct. Your machine learning model is basically overfitting on your training data which has the same pattern repeated for one class and thus, the model learns that pattern and misses the rest of the patterns, that is there in test data. This means that the model will not perform well in the wild world.

If SMOTE is not working, you can experiment by testing different machine learning models. Random forest generally performs well on this type of datasets, so try to tune your rf model by pruning it or tuning the hyperparameters. Another way is to assign the class weights when training the model. You can also try penalized models which imposes an additional cost on the model when the misclassify the minority class.

You can also try undersampling since you have already tested oversampling. But most probably your undersampling will also suffer from the same problem. Please try simple oversampling as well instead of SMOTE to see how your results change.

Another more advanced method that you should experiment is batching. Take all of your minority class and an equal number of entries from the majority class and train a model. Keep doing this for all the batches of your majority class and in the end you will have multiple machine learning models, which you can then use together to vote.

Upvotes: 2

Related Questions