Reputation: 653
I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.
I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.
from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score
kf = KFold(n_splits=5)
for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
X_train = X[train_index]
y_train = y[train_index]
X_test = X[test_index]
y_test = y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model = ... # classification model example
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f'For fold {fold}:')
print(f'Accuracy: {model.score(X_test, y_test)}')
print(f'f-score: {f1_score(y_test, y_pred)}')
Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.
X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()
logo.get_n_splits(X_std, y, groups)
cv=logo.split(X_std, y, groups)
scores=[]
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
model.fit(X_train, y_train.ravel())
scores.append(model.score(X_test, y_test.ravel()))
How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.
Upvotes: 16
Views: 2057
Reputation: 155
The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.
In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.
for train_index, test_index in cv:
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index)
X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
sm = SMOTE()
X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
model.fit(X_train_oversampled, y_train_oversampled.ravel())
scores.append(model.score(X_test, y_test.ravel()))
Upvotes: 1