Reputation: 393
I am working on an imbalanced dataset which is created using the below code
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1,
weights=[0.99], flip_y=0, random_state=1)
I tried getting rid of the imbalance using SMOTE oversampling and then tried fitting a ML model. This was done using the normal method and then by creating a pipeline.
Normal method
from imblearn.over_sampling import SMOTE
oversampled_data = SMOTE(sampling_strategy=0.5)
X_over, y_over = oversampled_data.fit_resample(X, y)
logistic = LogisticRegression(solver='liblinear')
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(logistic, X_over, y_over, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f} F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
Output - Accuracy: 0.93, Precison: 0.92, Recall: 0.86, F1: 0.89
Pipeline
from imblearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import cross_val_score
oversampled_data = SMOTE(sampling_strategy=0.5)
pipeline = Pipeline([('smote', oversampled_data), ('model', LogisticRegression())])
# pipeline = make_pipeline(oversampled_data, logistic)
scoring = ['accuracy', 'precision', 'recall', 'f1']
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluating the model
scores = cross_validate(pipeline, X, y, scoring=scoring, cv=cv, n_jobs=-1, return_train_score=True)
print('Accuracy: {:.2f}, Precison: {:.2f}, Recall: {:.2f}, F1: {:.2f}'.format(np.mean(scores['test_accuracy']), np.mean(scores['test_precision']), np.mean(scores['test_recall']), np.mean(scores['test_f1'])))
Output - Accuracy: 0.96, Precison: 0.19, Recall: 0.84, F1: 0.31
What am I doing wrong when using a Pipeline, why is the Precision and F1 score so poor when using a pipeline?
Upvotes: 0
Views: 808
Reputation: 12698
In the first approach, you create the synthetic examples before splitting the training and test sets, whereas in the second you do it after splitting.
The former approach adds synthetic datapoints to the test set, but the latter does not. Furthermore, the former approach produces inflated scores from data leakage: it adds the synthetic test samples based (in part) on some datapoints from the training dataset. See e.g.
Upvotes: 2