Reputation: 455
I have a dataset that I want to run the sklearn SVM's SVC
model on. The magnitudes of some values of the features are in the range of [0, 1e+7]. I have tried to use SVC
w/o preprocessing and I either get unacceptably long compute times, or 0 true positive predictions. Thusly, I am attempting to implement a preprocessing step, particularly the MinMaxScaler
.
My code so far:
selection_KBest = SelectKBest()
selection_PCA = PCA()
combined_features = FeatureUnion([("pca", selection_PCA),
("univ_select", selection_KBest)])
param_grid = dict(features__pca__n_components = range(feature_min,feature_max),
features__univ_select__k = range(feature_min,feature_max))
svm = SVC()
pipeline = Pipeline([("features", combined_features),
("scale", MinMaxScaler(feature_range=(0, 1))),
("svm", svm)])
param_grid["svm__C"] = [0.1, 1, 10]
cv = StratifiedShuffleSplit(y = labels_train,
n_iter = 10,
test_size = 0.1,
random_state = 42)
grid_search = GridSearchCV(pipeline,
param_grid = param_grid,
verbose = 1,
cv = cv)
grid_search.fit(features_train, labels_train)
"(grid_search.best_estimator_): ", (grid_search.best_estimator_)
My question is specific to line:
pipeline = Pipeline([("features", combined_features),
("scale", MinMaxScaler(feature_range=(0, 1))),
("svm", svm)])
I would like to know what the best logic is for my program, and thus the order of features
, scale
, svm
in pipeline
. Specifically, I cannot decide if features
and scale
should be switched from what it is now.
Note 1: I would like to use grid_search.best_estimator_
as my Classifier model going forward for predictions.
Note 2: My concern is the correct way to formulate pipeline
so that upon prediction step, the features are selected from the way it was done in the training step AND scaled.
Note 3: I notice that svm
doesn't appear in my grid_search.best_estimator_
result. Does this mean it is not being invoked correctly?
Below are some results that indicate that order may matter:
pipeline = Pipeline([("scale", MinMaxScaler(feature_range=(0, 1))),
("features", combined_features),
("svm", svm)]):
Pipeline(steps=[('scale', MinMaxScaler(copy=True, feature_range=(0, 1)))
('features', FeatureUnion(n_jobs=1, transformer_list=[('pca', PCA(copy=True,
n_components=11, whiten=False)), ('univ_select', SelectKBest(k=2,
score_func=<function f_classif at 0x000000001ED61208>))],
transformer_weights=...f', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False))])
Accuracy: 0.86247 Precision: 0.38947 Recall: 0.05550
F1: 0.09716 F2: 0.06699 Total predictions: 15000
True positives: 111 False positives: 174
False negatives: 1889 True negatives: 12826
pipeline = Pipeline([("features", combined_features),
("scale", MinMaxScaler(feature_range=(0, 1))),
("svm", svm)]):
Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, n_components=1, whiten=False)),
('univ_select', SelectKBest(k=1, score_func=<function f_classif at
0x000000001ED61208>))],
transformer_weights=None)), ('scale', MinMaxScaler(copy=True, feature_range=
(0,...f', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
Accuracy: 0.86680 Precision: 0.50463 Recall: 0.05450
F1: 0.09838 F2: 0.06633 Total predictions: 15000
True positives: 109 False positives: 107
False negatives: 1891 True negatives: 12893
EDIT 1 16041310:
Note 3 resolved. Use grid_search.best_estimator_.steps
to get full steps.
Upvotes: 3
Views: 2658
Reputation: 36555
There is a parameter refit
in GridsearchCV (which defaults to True
) which means that the best estimator will be refit against the full dataset; you will then be access this estimator with best_estimator_
, or just with the fit
method on your GridsearchCV
object.
The best_estimator_
will be the full pipeline, if you call predict
on it, you'll get the same preprocessing steps as in your training stage.
If you want to print out all the steps, you could do
print(grid_search.best_estimator_.steps)
or
for step in grid_search.best_estimator_.steps:
print(type(step))
print(step.get_params())
Upvotes: 1