Reputation: 23
I need to train a SVM model using LinearSVC and a 10-fold cross-validation with an internal 2-fold Gridsearch to optimze gamma and C. But I also have to apply PCA on my data to reduce its size. Should I apply PCA before or within the loop where the CV and training of the model happens? In the latter case I would have different numbers of Principal Components for each loop, but is there a disadvantage on that?
Upvotes: 1
Views: 804
Reputation: 5304
The best solution would be to create a sklearn Pipeline
and put both steps (PCA
and LinarSvc
within it). This will create an object that implement fit()
and predict()
and that can be used within a GridSearchCV
.
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([('pca', PCA()),
('clf', LinearSVC())])
params = {
'pca__n_components' : [2, 5, 10, 15],
'clf__C' : [0.5, 1, 5, 10],
}
gs = GridSearchCV(estimator=pipe, param_grid=params)
gs.fit(X_train, y_train)
Upvotes: 3