Amelio Vazquez-Reina
Amelio Vazquez-Reina

Reputation: 96468

GridSearchCV extremely slow on small dataset in scikit-learn

This is odd. I can successfully run the example grid_search_digits.py. However, I am unable to do a grid search on my own data.

I have the following setup:

import sklearn
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import LeaveOneOut
from sklearn.metrics import auc_score

# ... Build X and y ....

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

loo = LeaveOneOut(len(y))
clf = GridSearchCV(SVC(C=1), tuned_parameters, score_func=auc_score)
clf.fit(X, y, cv=loo)
....
print clf.best_estimator_
....

But I never get passed clf.fit (I left it run for ~1hr).

I have tried also with

clf.fit(X, y, cv=10)

and with

skf = StratifiedKFold(y,2)
clf.fit(X, y, cv=skf)

and had the same problem (it never finishes the clf.fit statement). My data is simple:

> X.shape
(27,26)

> y.shape
27

> numpy.sum(y)
5

> y.dtype
dtype('int64')


>?y
Type:       ndarray
String Form:[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1]
Length:     27
File:       /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py                                                
Docstring:  <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

> ?X
Type:       ndarray
String Form:
       [[ -3.61238468e+03  -3.61253920e+03  -3.61290196e+03  -3.61326679e+03
           7.84590361e+02   0.0000 <...> 0000e+00   2.22389150e+00   2.53252959e+00 
           2.11606216e+00  -1.99613432e+05  -1.99564828e+05]]
Length:     27
File:       /home/jacob04/opt/python/numpy/numpy-1.7.1/lib/python2.7/site-
packages/numpy/__init__.py                                                
Docstring:  <no docstring>
Class Docstring:
ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

This is all with the latest version of scikit-learn (0.13.1) and:

$ pip freeze
Cython==0.19.1
PIL==1.1.7
PyXB==1.2.2
PyYAML==3.10
argparse==1.2.1
distribute==0.6.34
epc==0.0.5
ipython==0.13.2
jedi==0.6.0
matplotlib==1.3.x
nltk==2.0.4
nose==1.3.0
numexpr==2.1
numpy==1.7.1
pandas==0.11.0
pyparsing==1.5.7
python-dateutil==2.1
pytz==2013b
rpy2==2.3.1
scikit-learn==0.13.1
scipy==0.12.0
sexpdata==0.0.3
six==1.3.0
stemming==1.0.1
-e git+https://github.com/PyTables/PyTables.git@df7b20444b0737cf34686b5d88b4e674ec85575b#egg=tables-dev
tornado==3.0.1
wsgiref==0.1.2

The odd thing is that fitting a single SVM is extremely fast:

>  %timeit clf2 = svm.SVC(); clf2.fit(X,y)                                                                                                             
1000 loops, best of 3: 328 us per loop

Update

I have noticed that if I pre-scale the data with:

from sklearn import preprocessing
X = preprocessing.scale(X) 

the grid search is extremely fast.

Why? Why does GridSearchCV is so sensitive to scaling while a regular svm.SVC().fit is not?

Upvotes: 8

Views: 27525

Answers (3)

user3666197
user3666197

Reputation: 1

As noted already, for SVM-based Classifiers ( as y == np.int* ) preprocessing is a must, otherwise the ML-Estimator's prediction capability is lost right by skewed features' influence onto a decission function.

As objected the processing times:

  • try to get better view what is your AI/ML-Model Overfit/Generalisation [C,gamma] landscape
  • try to add verbosity into the initial AI/ML-process tuning
  • try to add n_jobs into the number crunching
  • try to add Grid Computing move into your computation approach if scale requires

.

aGrid = aML_GS.GridSearchCV( aClassifierOBJECT,
                                    param_grid = aGrid_of_parameters,
                                    cv         = cv,
                                    n_jobs     = n_JobsOnMultiCpuCores,
                                    verbose    = 5 )

Sometimes, the GridSearchCV() can indeed take a huge amount of CPU-time / CPU-poolOfRESOURCEs, even after all the above mentioned tips are used.

So, keep calm and do not panic, if you are sure the Feature-Engineering, data-sanity & FeatureDOMAIN preprocessing was done correctly.

[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.761619 -62.7min
[GridSearchCV] C=16777216.0, gamma=0.5 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=0.5, score=0.792793 -64.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.793103 -116.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.794603 -205.4min
[GridSearchCV] C=16777216.0, gamma=1.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=1.0, score=0.771772 -200.9min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.713643 -446.0min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.743628 -184.6min
[GridSearchCV] C=16777216.0, gamma=2.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=2.0, score=0.761261 -281.2min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ............... C=16777216.0, gamma=4.0, score=0.670165 -138.7min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.760120 -97.3min
[GridSearchCV] C=16777216.0, gamma=4.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=4.0, score=0.732733 -66.3min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.755622 -13.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.772114 - 4.6min
[GridSearchCV] C=16777216.0, gamma=8.0 .........................................
[GridSearchCV] ................ C=16777216.0, gamma=8.0, score=0.717718 -14.7min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.763118 - 1.3min
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.746627 -  25.4s
[GridSearchCV] C=16777216.0, gamma=16.0 ........................................
[GridSearchCV] ............... C=16777216.0, gamma=16.0, score=0.738739 -  44.9s
[Parallel(n_jobs=1)]: Done 2700 out of 2700 | elapsed: 5670.8min finished

As have asked above about "... a regular svm.SVC().fit " kindly notice, it uses default [C,gamma] values and thus have no relevance to behaviour of your Model / ProblemDOMAIN.

Re: Update

oh yes indeed, regularisation/scaling of SVM-inputs is a mandatory task for this AI/ML tool. scikit-learn has a good instrumentation to produce and re-use aScalerOBJECT for both a-priori scaling ( before aDataSET goes into .fit() ) & ex-post ad-hoc scaling, once you need to re-scale a new example and send it to the predictor to answer it's magic via a request to
anSvmCLASSIFIER.predict( aScalerOBJECT.transform( aNewExampleX ) )

( Yes, aNewExampleX may be a matrix, so asking for a "vectorised" processing of several answers )

Performance relief of O( M 2 . N 1 ) computational complexity

In contrast to the below posted guess, that the Problem-"width", measured as N == a number of SVM-Features in matrix X is to be blamed for an overall computing time, the SVM classifier with rbf-kernel is by-design an O( M 2 . N 1 ) problem.

So, there is quadratic dependence on the overall number of observations ( examples ), moved into a Training ( .fit() ) or CrossValidation phase and one can hardly state, that the supervised learning classifier will get any better predictive power if one "reduces" the ( linear only ) "width" of features, that per se bear the inputs into the constructed predictive power of the SVM-classifier, don't they?

Upvotes: 13

Heapify
Heapify

Reputation: 2901

I just want to point out one thing that may speed up any GridSearchCV, in case you're not using it already:

There is a parameter called n_jobs in GridSearchCV which uses multiple cores of your processor which will speed up the process. For example:

GridSearchCV(clf, verbose=1, param_grid=tuned_parameters, n_jobs=-1)

Specifying -1 will use all available CPU cores. If you have four cores with hyper-threading, that will spin up 8 concurrent workers as opposed to 1 when you don't use this parameter

Upvotes: 1

Santosh
Santosh

Reputation: 121

Support Vector Machines are sensitive to scaling. It is most likely that your SVC is taking a longer time to build an individual model. GridSearch is basically a brute force method which runs the base models with different parameters. So, if your GridSearchCV is taking time to build, it is more likely due to

  1. Large number of parameter combinations (Which is not the case here)
  2. Your individual model takes a lot of time.

Upvotes: 3

Related Questions