tsumaranaina
tsumaranaina

Reputation: 183

SMOTE in ML classifcation

I am running a classification algorithm in Jupyter, using sklearn. I want to use SMOTE, since one of my groups is only 35% of the other 2 groups. So I want to oversample that group (group 1) but I don't know how to integrate it. (Edit: I know about the SMOTE script, but i want to know where it fits in my script below). Help?

from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.1) 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
from sklearn.svm import SVC
clf = SVC(kernel = 'linear')
clf.fit (X_train, y_train.ravel())
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score (estimator = clf, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1, 10, 100], 'kernel':['linear']}, 
              {'C':[1, 10, 100], 
               'kernel':['rbf'], 
               'gamma': [0.05, 0.001, 0.005]}]
grid_search = GridSearchCV (estimator = clf, param_grid = parameters, scoring = 'accuracy', cv = 10)
grid_search = grid_search.fit (X_train,y_train)
best_accuracy =  grid_search.best_score_ 
print (best_accuracy)
best_parameters = grid_search.best_params_
print (best_parameters)

Upvotes: 1

Views: 1026

Answers (3)

Idriss
Idriss

Reputation: 11

You may separate your data into train and test before doing SMOTE to avoid overfitting. The right way: oversampling only the training data. https://beckernick.github.io/oversampling-modeling/

Upvotes: 1

Sreeram TP
Sreeram TP

Reputation: 11907

You have to do SMOTE on your dataset and use the resulting balanced dataset for training you model.

So, you have to load the data as you do in your code which is not shown in the question and apply SMOTE on it.

In terms of code it can be done like this

X = # train data
y = # train labels

# applying SMOTE

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X, y)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X_balanced, y_balanced, test_size = 0.1) 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
from sklearn.svm import SVC
clf = SVC(kernel = 'linear')
clf.fit (X_train, y_train.ravel())
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score (estimator = clf, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1, 10, 100], 'kernel':['linear']}, 
              {'C':[1, 10, 100], 
               'kernel':['rbf'], 
               'gamma': [0.05, 0.001, 0.005]}]
grid_search = GridSearchCV (estimator = clf, param_grid = parameters, scoring = 'accuracy', cv = 10)
grid_search = grid_search.fit (X_train,y_train)
best_accuracy =  grid_search.best_score_ 
print (best_accuracy)
best_parameters = grid_search.best_params_
print (best_parameters)

Upvotes: 1

Gambit1614
Gambit1614

Reputation: 8801

You can use SMOTE from imbalanced learn like this

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X, y) #where X and y are your original features and labels

Then use X_balanced and y_balanced as your X and y respectively

Upvotes: 1

Related Questions