Reputation: 1205
I'm dealing with a very imbalanced dataset (~5%) on a binary classification problem. I'm piping SMOTE and a Random Forest classifier to get my oversampling happening inside a GridSearch CV loop (as suggested here). You can see my implementation below:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from imblearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
sm = SMOTE()
rf = RandomForestClassifier()
pipeline = Pipeline([('sm', sm), ('rf', rf)])
kf = StratifiedKFold(n_splits = 5)
params = {'rf__max_depth' : list(range(2,5)),
'rf__max_features' : ['auto','sqrt'],
'rf__bootstrap' : [True, False]
}
grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)
grid.fit(X, y)
However, this paper (see Table 4 page 7) suggests testing different resampling ratios to figure out which one gives a better performance. Right now, with my sm = SMOTE() I'm generating a 50-50% dataset, but I would like to loop over a list of potential ratios (e.g. 5-95, 10-90, etc.). However, the ratio parameter in SMOTE doesn't accept a desired percentage ratio, but a specific integer with the number of samples, which I don't think I can do due to my kfold CV (each fold may potentially have a slightly different sample size). How could this be implemented?
Upvotes: 2
Views: 3866
Reputation: 36599
Though not mentioned in docs, but I think you can put float
to specify ratio
. But you should know that its deprecated and will be removed in future versions (because I think this only works for binary cases and not multiclass).
params = {'sm__ratio' : [0.05, 0.10, 0.15],
'rf__max_depth' : list(range(2,5)),
'rf__max_features' : ['auto','sqrt'],
'rf__bootstrap' : [True, False]
}
grid = RandomizedSearchCV(pipeline, param_distributions = params, scoring = 'f1', cv = kf)
Also note that, the ratio you mentioned here will be the ratio of the classes after upsampling the minority class.
So let's say you have original classes as follows:
1: 75
0: 25
And you specify the ratio as 0.5. Here majority class will not be touched, but 12 more synthetic samples of class 0 will be generated, so final numbers are:
1: 75
0: 37 (25 + 12)
And the final ratio is 37 / 75 = 0.5 (as you mentioned).
Upvotes: 1