Reputation: 598
I have written a simple benchmark that shows that using the GridSearchCV fit function in scikit-learn with the base classifier as LogisticRegression and only one set of possible hyperparameters takes at least 8 times and up to 19 times longer than just using the fit function of the base classifier. Any idea why this big difference is happening? Here's the code:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])
ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]
n = 20
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
search = clf.fit(iris.data, iris.target)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = logistic
search = clf.fit(train_X, train_Y)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
The results for clf = logistic, and clf = GSCV are 0.324 and 0.017, respectively (19 times slower). Note that for GSCV, there's only one set of hyperparams possible (C=1, penalty='l1'), so basically, GSCV has to fit only one clf and not multiple, and there's no CV (only one set of splits is given to it), yet it's taking much more time!
If I make iris.data and itis.target 100 times larger:
iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)
I get these results: 0.528 and 0.064 (8 times slower). With 1000 times larger iris.data and iris.target: 2.70 and 0.34 (8 times slower).
I tested with the normal iris.data and iris.target with RandomizedCV:
clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)
and got these results: 0.337 and 0.013 (26 times slower).
Upvotes: 1
Views: 504
Reputation: 46888
I see that you are calling 36 threads for the logistic regression part, and then on top of this, trying to parallelize this in GridSearchCV with n_jobs=36
. This is parallelizing twice and might slow down your processes.
For logistic regression, parallel only works if you have multi class, as from the documentation:
n_jobs int, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”.
If you have 2 or 3 classes, like in iris, it doesn't quite matter, so you can do:
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0, n_jobs=None)
GridSearchCV does more than fitting the model, it calculates the score of the model, and also summarizes it across. So you need to set refit = False
to ensure it doesn't refit the best model on the full dataset. Also in this example, setting cv=None ensures it doesn't run parallel processes. We can also include a test dataset so that we perform the scoring as well:
test_X = iris.data[ls[0][1]]
test_Y = iris.target[ls[0][1]]
For example if I do :
n = 20
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = GridSearchCV(logistic, distributions, n_jobs=None, cv=ls, refit = False)
search = clf.fit(iris.data, iris.target)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
tot_t = 0
for _ in range(n):
t0 = time.time()
clf = logistic
search = clf.fit(train_X, train_Y)
score_test = clf.score(test_X,test_Y)
tot_t += time.time() - t0
print(f"avg = {tot_t / n}")
I get:
avg = 0.0036052584648132322
avg = 0.0019430875778198241
So the difference is not so huge, once you account for the other calculations made by GridSearchCV.
In your case, I believe your GridSearchCV
is caused by calling n_job = 36 in both logisticRegression
and GridSearchCV
. Most likely you only want to call it once with GridSearchCV
, in the case of testing many hyperparameters.
Upvotes: 1