Reputation: 9702
I have the following code which works as expected:
clf = Pipeline([
('vectorizer', DictVectorizer(sparse=False)),
('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:size], y[:size])
score = clf.score(X_test, y_test)
I wanted to do the same logic without using Pipeline:
v = DictVectorizer(sparse=False)
Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])
clf.score(Xdv_test, y_test)
But I receive the following error:
ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303
It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.
Upvotes: 0
Views: 270
Reputation: 36599
Dont call fit_transform
again.
Do this:
Xdv_test = v.transform(X_test)
When you do fit()
or fit_transform()
, the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.
Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test)
on the pipeline.
Upvotes: 2