Reputation: 21
I write this code:
LR=LogisticRegression()
pipe_lr= Pipeline ([
('oversampling', SMOTE()),
('LR', LR)
])
C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ]
solver_list_lr=[ 'liblinear', 'newton-cg', 'saga']
penalty_list_lr=[None, 'elasticnet', 'l1', 'l2']
max_iter_list_lr=[100, 1000, 3000]
random_state_list_lr=[None, 20, 42 ]
param_grid_lr = {
'LR__C': C_list_lr,
'LR__solver': solver_list_lr,
'LR__penalty': penalty_list_lr,
'LR__max_iter': max_iter_list_lr,
'LR__random_state': random_state_list_lr
}
grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='accuracy', return_train_score=False)
grid_lr.fit(x1_train, y1_train)
I have two questions:
LogisticRegression
with parameters chosen by myself and without oversampling?I work with a set containing 4024 data. It is a binary classification problem and I have ~3400 examples in one class and just 624 in the second one. When I implemented the same algorithm on the dataset without any over/under-sampling, I received 0.89, but after oversampling and GridSearchCV
just 0.83
Upvotes: 2
Views: 544
Reputation: 4273
Brief answers:
Pipeline
helps avoid most of the worst mistakes. However: the parameter grid contains combinations that will result in a large number of NaN
values (e.g. missing l1_ratio
for penalty="elasticnet"
), so I've added suggestions below.SMOTE
also modifies the feature space during learning, so simpler baselines like ROS/RUS are worth testing.Here's a grid search using the saga
solver (which supports all penalty
parameters) that selects for balanced accuracy:
from imblearn.pipeline import Pipeline
pipe_lr = Pipeline ([
('scale', StandardScaler()), # L2 is problematic with non-normal data
('oversampling', SMOTE()),
('LR', LogisticRegression(solver="saga")),
])
param_grid_lr = {
'LR__C': [0.001, 0.01, 0.1],
'LR__l1_ratio': [0.2, 0.4, 0.6, 0.8],
'LR__penalty': [None, 'elasticnet', 'l1', 'l2'],
}
grid_lr = GridSearchCV(pipe_lr, param_grid_lr, cv=5, scoring='balanced_accuracy', verbose=3)
import warnings
with warnings.catch_warnings():
# Warning filter is optional, but fit will warn when parameters go unused.
warnings.simplefilter("ignore")
grid_lr.fit(X, y)
print(grid_lr.best_params_)
Upvotes: 1