Victor
Victor

Reputation: 1205

How to implement ratio-based SMOTE oversampling while CV-ing dataset

I'm dealing with a very imbalanced dataset (~5%) on a binary classification problem. I'm piping SMOTE and a Random Forest classifier to get my oversampling happening inside a GridSearch CV loop (as suggested here). You can see my implementation below:

from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

sm = SMOTE()
rf = RandomForestClassifier()

pipeline = Pipeline([('sm', sm), ('rf', rf)])

kf = StratifiedKFold(n_splits = 5)

params = {'rf__max_depth' : list(range(2,5)),
    'rf__max_features' : ['auto','sqrt'],
    'rf__bootstrap' : [True, False]
}

grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)

grid.fit(X, y)

However, this paper (see Table 4 page 7) suggests testing different resampling ratios to figure out which one gives a better performance. Right now, with my sm = SMOTE() I'm generating a 50-50% dataset, but I would like to loop over a list of potential ratios (e.g. 5-95, 10-90, etc.). However, the ratio parameter in SMOTE doesn't accept a desired percentage ratio, but a specific integer with the number of samples, which I don't think I can do due to my kfold CV (each fold may potentially have a slightly different sample size). How could this be implemented?

Upvotes: 2

Views: 3866

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

Though not mentioned in docs, but I think you can put float to specify ratio. But you should know that its deprecated and will be removed in future versions (because I think this only works for binary cases and not multiclass).

params = {'sm__ratio' : [0.05, 0.10, 0.15],
          'rf__max_depth' : list(range(2,5)),
          'rf__max_features' : ['auto','sqrt'],
          'rf__bootstrap' : [True, False]
         }

grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)

Also note that, the ratio you mentioned here will be the ratio of the classes after upsampling the minority class.

So let's say you have original classes as follows:

  1:  75
  0:  25  

And you specify the ratio as 0.5. Here majority class will not be touched, but 12 more synthetic samples of class 0 will be generated, so final numbers are:

  1:  75
  0:  37  (25 + 12) 

And the final ratio is 37 / 75 = 0.5 (as you mentioned).

Upvotes: 1

Related Questions