python sklearn pickled model doesn't have the same number of features

Question

I created an SVC model using sklearn and pickled it:

clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
#model_file_name='classify_pages_model'

with open('our_classifier.pkl', 'wb') as fid:
    cPickle.dump(clf, fid)

and I try to load it and use it in another file,

with open('our_classifier.pkl', 'rb') as fid:
    clf = cPickle.load(fid)

X_test=tfidf_vectorizer.fit_transform((get_text(f) for f in urls))

pred=clf.predict(X_test)

it gives me this error:

ValueError: X has 664 features per sample; expecting 47387

How can I make sure the features in my test documents are the same as in the model?

----EDIT

The problem does not happen when I am doing the training and testing in the same code (but only when I pickle the model and load it from another code)

The following code works correctly, but when I pickle clf I am unable to perform the testing part because the number of features in the X_test is not the same as in clf

1-Training

X_train=tfidf_vectorizer.fit_transform((read(f) for f in train_files_paths))
clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)

2- Testing

X_test=tfidf_vectorizer.transform((get_text(f) for f in urls))
pred=clf.predict(X_test)

BartoszKP · Accepted Answer

You can't do fit_transform on the test set again. This is a form of data snooping and is discouraged (apart from not working in your example). All things that account for learning (feature extraction being one of them) can be done only on the training set.

You need to pickle the feature extractor as well, and do just transform on the test data. This answer suggests that there should be no problem with pickling the vectorizer.

python sklearn pickled model doesn't have the same number of features

Answers (1)

Related Questions

python sklearn pickled model doesn&#39;t have the same number of features

Answers (1)

Related Questions

python sklearn pickled model doesn't have the same number of features