TTZ
TTZ

Reputation: 883

Oversampling: SMOTE for binary and categorical data in Python

I would like to apply SMOTE to unbalanced dataset which contains binary, categorical and continuous data. Is there a way to apply SMOTE to binary and categorical data?

Upvotes: 18

Views: 33339

Answers (3)

cph_sto
cph_sto

Reputation: 7607

As of Jan, 2018 this issue has not been implemened in Python. Following is a reference from the team. Infact they are open to proposals if someone wants to implement it.

For those with an academic interest in this ongoing issue, the paper (web archive) from Chawla & Bowyer addresses this SMOTE-Non Continuous sampling problem in section 6.1.

Update: This feature has been implemented as of 21 Oct, 2018. Service request stands closed now.

Upvotes: 4

A.T.B
A.T.B

Reputation: 653

As per the documentation, this is now possible with the use of SMOTENC. SMOTE-NC is capable of handling a mix of categorical and continuous features.

Here is the code from the documentation:

from imblearn.over_sampling import SMOTENC
smote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X, y)

Upvotes: 23

mank
mank

Reputation: 924

So as per documentation SMOTE doesn't support Categorical data in Python yet, and provides continuous outputs.

You can instead employ a workaround where you convert the categorical variables to integers and use SMOTE.

Then use np.round(X_train[categorical_variables]) to convert them back to the respective categorical values.

Upvotes: 4

Related Questions