GridSearchCV fit takes 19 times longer than the base classifier even when there's only one possible set of params

Question

I have written a simple benchmark that shows that using the GridSearchCV fit function in scikit-learn with the base classifier as LogisticRegression and only one set of possible hyperparameters takes at least 8 times and up to 19 times longer than just using the fit function of the base classifier. Any idea why this big difference is happening? Here's the code:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

iris = load_iris()
logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200,
                              random_state=0, n_jobs=36)
distributions = dict(C=[1], penalty=['l1'])

ls = [next(ShuffleSplit(n_splits=1, test_size=.25, random_state=0).split(iris.data))]
train_X = iris.data[ls[0][0]]
train_Y = iris.target[ls[0][0]]

n = 20
tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = GridSearchCV(logistic, distributions, n_jobs=36, cv=ls)
    search = clf.fit(iris.data, iris.target)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

tot_t = 0
for _ in range(n):
    t0 = time.time()
    clf = logistic
    search = clf.fit(train_X, train_Y)
    tot_t += time.time() - t0
print(f"avg = {tot_t / n}")

The results for clf = logistic, and clf = GSCV are 0.324 and 0.017, respectively (19 times slower). Note that for GSCV, there's only one set of hyperparams possible (C=1, penalty='l1'), so basically, GSCV has to fit only one clf and not multiple, and there's no CV (only one set of splits is given to it), yet it's taking much more time!

If I make iris.data and itis.target 100 times larger:

iris.data = np.repeat(iris.data, 100, axis=0)
iris.target = np.repeat(iris.target, 100, axis=0)

I get these results: 0.528 and 0.064 (8 times slower). With 1000 times larger iris.data and iris.target: 2.70 and 0.34 (8 times slower).

I tested with the normal iris.data and iris.target with RandomizedCV:

clf = RandomizedSearchCV(logistic, distributions, random_state=2, n_jobs=36, cv=ls)

and got these results: 0.337 and 0.013 (26 times slower).

GridSearchCV fit takes 19 times longer than the base classifier even when there's only one possible set of params

Answers (1)

Related Questions

GridSearchCV fit takes 19 times longer than the base classifier even when there&#39;s only one possible set of params

Answers (1)

Related Questions

GridSearchCV fit takes 19 times longer than the base classifier even when there's only one possible set of params