Reputation: 1502
I created an SVC model using sklearn and pickled it:
clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
#model_file_name='classify_pages_model'
with open('our_classifier.pkl', 'wb') as fid:
cPickle.dump(clf, fid)
and I try to load it and use it in another file,
with open('our_classifier.pkl', 'rb') as fid:
clf = cPickle.load(fid)
X_test=tfidf_vectorizer.fit_transform((get_text(f) for f in urls))
pred=clf.predict(X_test)
it gives me this error:
ValueError: X has 664 features per sample; expecting 47387
How can I make sure the features in my test documents are the same as in the model?
----EDIT
The problem does not happen when I am doing the training and testing in the same code (but only when I pickle the model and load it from another code)
The following code works correctly, but when I pickle clf I am unable to perform the testing part because the number of features in the X_test is not the same as in clf
1-Training
X_train=tfidf_vectorizer.fit_transform((read(f) for f in train_files_paths))
clf=LinearSVC(loss='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
2- Testing
X_test=tfidf_vectorizer.transform((get_text(f) for f in urls))
pred=clf.predict(X_test)
Upvotes: 2
Views: 2867
Reputation: 35901
You can't do fit_transform
on the test set again. This is a form of data snooping and is discouraged (apart from not working in your example). All things that account for learning (feature extraction being one of them) can be done only on the training set.
You need to pickle the feature extractor as well, and do just transform
on the test data. This answer suggests that there should be no problem with pickling the vectorizer.
Upvotes: 5