William Gottschalk
William Gottschalk

Reputation: 261

GridSearch with SVM producing IndexError

I'm building a classifier using an SVM and want to perform a Grid Search to help automate finding the optimal model. Here's the code:

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.multiclass import OneVsRestClassifier

X.shape     # (22343, 323)
y.shape     # (22343, 1)

X_train, X_test, y_train, y_test = train_test_split(
  X, Y, test_size=0.4, random_state=0
)

tuned_parameters = [
  {
    'estimator__kernel': ['rbf'],
    'estimator__gamma': [1e-3, 1e-4],
    'estimator__C': [1, 10, 100, 1000]
  },
  {
    'estimator__kernel': ['linear'], 
    'estimator__C': [1, 10, 100, 1000]
  }
]

model_to_set = OneVsRestClassifier(SVC(), n_jobs=-1)
clf = GridSearchCV(model_to_set, tuned_parameters)
clf.fit(X_train, y_train)

and I get the following error message (this isn't the whole stack trace. just the last 3 calls):

----------------------------------------------------
/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
     88         X, y, groups = indexable(X, y, groups)
     89         indices = np.arange(_num_samples(X))
---> 90         for test_index in self._iter_test_masks(X, y, groups):
     91             train_index = indices[np.logical_not(test_index)]
     92             test_index = indices[test_index]

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _iter_test_masks(self, X, y, groups)
    606 
    607     def _iter_test_masks(self, X, y=None, groups=None):
--> 608         test_folds = self._make_test_folds(X, y)
    609         for i in range(self.n_splits):
    610             yield test_folds == i

/anaconda/lib/python3.5/site-packages/sklearn/model_selection/_split.py in _make_test_folds(self, X, y, groups)
    593         for test_fold_indices, per_cls_splits in enumerate(zip(*per_cls_cvs)):
    594             for cls, (_, test_split) in zip(unique_y, per_cls_splits):
--> 595                 cls_test_folds = test_folds[y == cls]
    596                 # the test split can be too big because we used
    597                 # KFold(...).split(X[:max(c, n_splits)]) when data is not 100%

IndexError: too many indices for array

Also, when I try reshaping the arrays so that the y is (22343,) I find that the GridSearch never finishes even if I set the tuned_parameters to only default values.

And here are the versions for all of the packages if that helps:

Python: 3.5.2

scikit-learn: 0.18

pandas: 0.19.0

Upvotes: 1

Views: 305

Answers (1)

MMF
MMF

Reputation: 5921

It seems that there is no error in your implementation.

However, as it's mentioned in the sklearndocumentation, the "fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples". See documentation here

In your case, you have 22343 samples, which can lead to some computational problems/memory issues. That is why when you do your default CV it takes a lot of time. Try to reduce your train set using 10000 samples or less.

Upvotes: 4

Related Questions