Reputation: 33
Let x
contain the variables: print(x)
Restaurant Cuisines Average_Cost Rating Votes Reviews Area
0 3.526361 0.693147 5.303305 1.504077 2.564949 1.609438 7.214504
1 1.386294 4.127134 4.615121 1.504077 2.484907 1.609438 5.905362
2 2.772589 1.386294 5.017280 1.526056 4.605170 3.433987 6.131226
3 3.912023 2.833213 5.525453 1.547563 5.176150 4.564348 7.643483
4 3.526361 2.708050 5.303305 1.435085 5.948035 5.046646 6.126869
... ... ... ... ... ... ... ...
11089 3.912023 0.693147 5.525453 1.648659 5.789960 5.046646 3.135494
11090 1.386294 6.028279 4.615121 1.526056 3.610918 2.833213 7.643483
11091 1.386294 2.397895 4.615121 1.504077 3.828641 2.944439 5.814131
11092 1.386294 6.028279 4.615121 1.410987 3.218876 2.302585 5.905362
11093 1.386294 6.028279 4.615121 1.029619 0.000000 0.000000 5.564520
11094 rows × 7 columns
And let y
be the multi-class target variable. print(y.value_counts())
30 minutes 7406
45 minutes 2665
65 minutes 923
120 minutes 62
20 minutes 20
80 minutes 14
10 minutes 4
Name: Delivery_Time, dtype: int64
After exploring the y
variable we can see that the 30 minutes
class has higher counts compared to the other classes.
To balance these, I tried SMOTETomek
to oversample the data. But I got an error:
from imblearn.combine import SMOTETomek
smk = SMOTEtomek(ratio = 1)
x_res, y_res = smk.fit_sample(x,y)
Error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-54-426e8b86623d> in <module>()
1 from imblearn.combine import SMOTETomek
2 smk = SMOTETomek(ratio = 1)
----> 3 x_res, y_res = smk.fit_sample(x,y)
2 frames
/usr/local/lib/python3.6/dist-packages/imblearn/utils/_validation.py in _sampling_strategy_float(sampling_strategy, y, sampling_type)
311 if type_y != 'binary':
312 raise ValueError(
--> 313 '"sampling_strategy" can be a float only when the type '
314 'of target is binary. For multi-class, use a dict.')
315 target_stats = _count_class_sample(y)
ValueError: "sampling_strategy" can be a float only when the type of target is binary. For multi-class, use a dict.
Upvotes: 3
Views: 8260
Reputation: 76
I think you should keep the target variables in the same proportion, because SMOTE may give you enhanced and better results on the testing data set, but the model may fail on the new data input from the user(live data).
Its up to you whether to apply SMOTE or not.You can use this code:
from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train_data,y_train_data)
Upvotes: 1
Reputation: 1515
You can just see the actual implementation of Smote
:
https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/utils/_validation.py#L355
You just need to pass the dictionary as it's mentioned in the error. But SMOTE algorithm internally takes care of multi-class setting.
Do:
from imblearn.oversampling import SMOTE
smote=SMOTE("minority")
X,Y=smote.fit_sample(x_train,y_train)
When dict, the keys correspond to the targeted classes. The
values correspond to the desired number of samples for each targeted
class.
Upvotes: 3