Deniss
Deniss

Reputation: 21

Using pipeline, SMOTE, and GridSearchCV together

I write this code:

LR=LogisticRegression()

pipe_lr= Pipeline ([
    ('oversampling', SMOTE()),
    ('LR', LR)
])

C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ]
solver_list_lr=[ 'liblinear', 'newton-cg', 'saga']
penalty_list_lr=[None, 'elasticnet', 'l1', 'l2']
max_iter_list_lr=[100, 1000, 3000]
random_state_list_lr=[None, 20, 42 ]
param_grid_lr = {
    'LR__C': C_list_lr, 
    'LR__solver': solver_list_lr,
    'LR__penalty': penalty_list_lr,
    'LR__max_iter': max_iter_list_lr,
    'LR__random_state': random_state_list_lr
}

grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='accuracy', return_train_score=False)
grid_lr.fit(x1_train, y1_train)

I have two questions:

  1. Is the code correct?
  2. Is it normal to obtain a lower accuracy score working in this way than simply using LogisticRegression with parameters chosen by myself and without oversampling?

I work with a set containing 4024 data. It is a binary classification problem and I have ~3400 examples in one class and just 624 in the second one. When I implemented the same algorithm on the dataset without any over/under-sampling, I received 0.89, but after oversampling and GridSearchCV just 0.83

Upvotes: 2

Views: 544

Answers (1)

Alexander L. Hayes
Alexander L. Hayes

Reputation: 4273

Brief answers:

  1. The code does not make any egregious errors: using a Pipeline helps avoid most of the worst mistakes. However: the parameter grid contains combinations that will result in a large number of NaN values (e.g. missing l1_ratio for penalty="elasticnet"), so I've added suggestions below.
  2. This is possible, but remember: accuracy is not a good metric on imbalanced learning problems since it is sensitive to class proportions. SMOTE also modifies the feature space during learning, so simpler baselines like ROS/RUS are worth testing.

Here's a grid search using the saga solver (which supports all penalty parameters) that selects for balanced accuracy:

from imblearn.pipeline import Pipeline

pipe_lr = Pipeline ([
    ('scale', StandardScaler()),     # L2 is problematic with non-normal data
    ('oversampling', SMOTE()),
    ('LR', LogisticRegression(solver="saga")),
])

param_grid_lr = {
    'LR__C': [0.001, 0.01, 0.1],
    'LR__l1_ratio': [0.2, 0.4, 0.6, 0.8],
    'LR__penalty': [None, 'elasticnet', 'l1', 'l2'],
}

grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='balanced_accuracy', verbose=3)

import warnings
with warnings.catch_warnings():
    # Warning filter is optional, but fit will warn when parameters go unused.
    warnings.simplefilter("ignore")
    grid_lr.fit(X, y)

print(grid_lr.best_params_)

Upvotes: 1

Related Questions