Reputation: 8018
I am creating a GridSearchCV
classifier as
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters= {}
gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
# Fit/train the gridSearchClassifier on Training Set
gridSearchClassifier.fit(Xtrain, ytrain)
This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform()
on some feedback data.
gridSearchClassifier.fit_transform(Xnew, yNew)
But I get this error
AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'
basically i am trying to fit_transform()
on the classifier's internal TfidfVectorizer
. I know that i can access the Pipeline
's internal components using the named_steps
attribute. Can i do something similar for the gridSearchClassifier
?
Upvotes: 5
Views: 3733
Reputation: 8270
@lejot is correct that you should call fit()
on the gridSearchClassifier
.
Provided refit=True
is set on the GridSearchCV
, which is the default, you can access best_estimator_
on the fitted gridSearchClassifier
.
You can access the already fitted steps:
tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']
You can then transform new text in new_X
using:
X_vec = tfidf.transform(new_X)
You can make predictions using this X_vec
with:
x_pred = clf.predict(X_vec)
You can also make predictions for the text going through the pipeline entire pipeline with.
X_pred = gridSearchClassifier.predict(new_X)
Upvotes: 2
Reputation: 66825
Just call them step by step.
gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)
the fit_transform
is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV
.
From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit
you get an estimator inside the best_estimator_
field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus
gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc.
# for example print clf.named_steps['vect']
you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_
instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).
Upvotes: 4