PCA on data and training with SVM with K-fold CV and Gridsearch

Question

I need to train a SVM model using LinearSVC and a 10-fold cross-validation with an internal 2-fold Gridsearch to optimze gamma and C. But I also have to apply PCA on my data to reduce its size. Should I apply PCA before or within the loop where the CV and training of the model happens? In the latter case I would have different numbers of Principal Components for each loop, but is there a disadvantage on that?

Antoine Dubuis · Accepted Answer

The best solution would be to create a sklearn Pipeline and put both steps (PCA and LinarSvc within it). This will create an object that implement fit() and predict() and that can be used within a GridSearchCV.

from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([('pca', PCA()),
                 ('clf', LinearSVC())])
params = {
    'pca__n_components' : [2, 5, 10, 15],
    'clf__C' : [0.5, 1, 5, 10],
}

gs = GridSearchCV(estimator=pipe, param_grid=params)
gs.fit(X_train, y_train)

PCA on data and training with SVM with K-fold CV and Gridsearch

Answers (1)

Related Questions