Diamond
Diamond

Reputation: 598

GridSearchCV fit takes 19 times longer than the base classifier even when there's only one possible set of params

I have written a simple benchmark that shows that using the GridSearchCV fit function in scikit-learn with the base classifier as LogisticRegression and only one set of possible hyperparameters takes at least 8 times and up to 19 times longer than just using the fit function of the base classifier. Any idea why this big difference is happening? Here's the code:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])

ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

The results for clf = logistic, and clf = GSCV are 0.324 and 0.017, respectively (19 times slower). Note that for GSCV, there's only one set of hyperparams possible (C=1, penalty='l1'), so basically, GSCV has to fit only one clf and not multiple, and there's no CV (only one set of splits is given to it), yet it's taking much more time!

If I make iris.data and itis.target 100 times larger:

iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)

I get these results: 0.528 and 0.064 (8 times slower). With 1000 times larger iris.data and iris.target: 2.70 and 0.34 (8 times slower).

I tested with the normal iris.data and iris.target with RandomizedCV:

clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)

and got these results: 0.337 and 0.013 (26 times slower).

Upvotes: 1

Views: 504

Answers (1)

StupidWolf
StupidWolf

Reputation: 46888

I see that you are calling 36 threads for the logistic regression part, and then on top of this, trying to parallelize this in GridSearchCV with n_jobs=36 . This is parallelizing twice and might slow down your processes.

For logistic regression, parallel only works if you have multi class, as from the documentation:

n_jobs int, default=None
Number of CPU cores used when parallelizing over classes if multi_class=’ovr’”. 

If you have 2 or 3 classes, like in iris, it doesn't quite matter, so you can do:

logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,random_state=0, n_jobs=None)

GridSearchCV does more than fitting the model, it calculates the score of the model, and also summarizes it across. So you need to set refit = False to ensure it doesn't refit the best model on the full dataset. Also in this example, setting cv=None ensures it doesn't run parallel processes. We can also include a test dataset so that we perform the scoring as well:

test_X = iris.data[ls[0][1]]
test_Y = iris.target[ls[0][1]]

For example if I do :

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=None, cv=ls, refit = False)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    score_test = clf.score(test_X,test_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

I get:

avg = 0.0036052584648132322
avg = 0.0019430875778198241

So the difference is not so huge, once you account for the other calculations made by GridSearchCV.

In your case, I believe your GridSearchCV is caused by calling n_job = 36 in both logisticRegression and GridSearchCV . Most likely you only want to call it once with GridSearchCV, in the case of testing many hyperparameters.

Upvotes: 1

Related Questions