DictVectorizer learns more features for the training set

Question

I have the following code which works as expected:

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=False)),
    ('classifier', DecisionTreeClassifier(criterion='entropy'))
])

clf.fit(X[:size], y[:size])

score = clf.score(X_test, y_test)

I wanted to do the same logic without using Pipeline:

v = DictVectorizer(sparse=False)

Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)

clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])

clf.score(Xdv_test, y_test)

But I receive the following error:

ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303

It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.

Vivek Kumar · Accepted Answer

Dont call fit_transform again.

Do this:

Xdv_test = v.transform(X_test)

When you do fit() or fit_transform(), the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.

Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test) on the pipeline.

DictVectorizer learns more features for the training set

Answers (1)

Related Questions