Reputation: 183
I am running a classification algorithm in Jupyter, using sklearn. I want to use SMOTE, since one of my groups is only 35% of the other 2 groups. So I want to oversample that group (group 1) but I don't know how to integrate it. (Edit: I know about the SMOTE script, but i want to know where it fits in my script below). Help?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
from sklearn.svm import SVC
clf = SVC(kernel = 'linear')
clf.fit (X_train, y_train.ravel())
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score (estimator = clf, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1, 10, 100], 'kernel':['linear']},
{'C':[1, 10, 100],
'kernel':['rbf'],
'gamma': [0.05, 0.001, 0.005]}]
grid_search = GridSearchCV (estimator = clf, param_grid = parameters, scoring = 'accuracy', cv = 10)
grid_search = grid_search.fit (X_train,y_train)
best_accuracy = grid_search.best_score_
print (best_accuracy)
best_parameters = grid_search.best_params_
print (best_parameters)
Upvotes: 1
Views: 1026
Reputation: 11
You may separate your data into train and test before doing SMOTE to avoid overfitting. The right way: oversampling only the training data. https://beckernick.github.io/oversampling-modeling/
Upvotes: 1
Reputation: 11907
You have to do SMOTE on your dataset and use the resulting balanced dataset for training you model.
So, you have to load the data as you do in your code which is not shown in the question and apply SMOTE on it.
In terms of code it can be done like this
X = # train data
y = # train labels
# applying SMOTE
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X, y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X_balanced, y_balanced, test_size = 0.1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform (X_test)
from sklearn.svm import SVC
clf = SVC(kernel = 'linear')
clf.fit (X_train, y_train.ravel())
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_test, y_pred)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score (estimator = clf, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
from sklearn.model_selection import GridSearchCV
parameters = [{'C':[1, 10, 100], 'kernel':['linear']},
{'C':[1, 10, 100],
'kernel':['rbf'],
'gamma': [0.05, 0.001, 0.005]}]
grid_search = GridSearchCV (estimator = clf, param_grid = parameters, scoring = 'accuracy', cv = 10)
grid_search = grid_search.fit (X_train,y_train)
best_accuracy = grid_search.best_score_
print (best_accuracy)
best_parameters = grid_search.best_params_
print (best_parameters)
Upvotes: 1
Reputation: 8801
You can use SMOTE from imbalanced learn like this
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_balanced, y_balanced = sm.fit_sample(X, y) #where X and y are your original features and labels
Then use X_balanced
and y_balanced
as your X
and y
respectively
Upvotes: 1