npm
npm

Reputation: 653

How to apply oversampling when doing Leave-One-Group-Out cross validation?

I am working on an imbalanced data for classification and I tried to use Synthetic Minority Over-sampling Technique (SMOTE) previously to oversampling the training data. However, this time I think I also need to use a Leave One Group Out (LOGO) cross-validation because I want to leave one subject out on each CV.

I am not sure if I can explain it nicely, but, as my understanding, to do k-fold CV using SMOTE we can loop the SMOTE on every fold, as I saw in this code on another post. Below is an example of SMOTE implementation on the k-fold CV.

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  
    X_test = X[test_index]
    y_test = y[test_index]  
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # classification model example
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

Without SMOTE, I tried to do this to do LOGO CV. But by doing this, I will be using a super imbalanced dataset.

X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()

logo.get_n_splits(X_std, y, groups)

cv=logo.split(X_std, y, groups)

scores=[]
for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    model.fit(X_train, y_train.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

How should I implement SMOTE inside a loop of leave-one-group-out CV? I am confused about how to define the group list for the synthetic training data.

Upvotes: 16

Views: 2057

Answers (1)

Muhammad Arslan
Muhammad Arslan

Reputation: 155

The approach suggested here LOOCV makes more sense for leave one out cross-validation. Leave one group which you will use as test set and over-sample the other remaining set. Train your classifier on all the over-sampled data and test your classifier on test set.

In your case, following code would be the correct way to implement SMOTE inside a loop of LOGO CV.

for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model.fit(X_train_oversampled, y_train_oversampled.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

Upvotes: 1

Related Questions