Reputation: 2554
I am pickling a model for later use. Then loading the model and running predict_proba
on it. I get ValueError: X has 1 features per sample; expecting 319
. Not sure if I am transforming it correctly
import csv, pickle
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.calibration import CalibratedClassifierCV
import numpy as np
import operator
train_data = []
train_labels = []
test_lables = []
test_lables.append("nah")
with open('training_file', 'r') as f:
reader = csv.reader(f, dialect='excel', delimiter='\t')
for row in reader:
train_data.append(row[0])
train_labels.append(row[1])
lables = []
for item in train_labels:
if item in lables:
continue
else:
lables.append(item)
def linear_svc(train_data, train_labels):
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(train_data)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
clf.fit(train_vectors, train_labels)
with open('test', 'wb') as fi:
pickle.dump(clf, fi)
def run_classifier():
vectorizer = TfidfVectorizer()
test_vectors = vectorizer.fit_transform(test_lables)
with open('test', 'rb') as fi:
clf = pickle.load(fi)
prediction_linear = clf.predict_proba(test_vectors)
return prediction_linear
#linear_svc(train_data, train_labels)
sorted_intent_probability = run_classifier()
print(sorted_intent_probability)
I first call the linear_svc()
method. The model gets pickled. Then I call run_classifier()
. What am I doing wrong here? Also, when I combine both the methods, it works fine:
def linear_svc(train_data, train_labels, test_lables):
vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_lables)
classifier_linear = svm.LinearSVC()
clf = CalibratedClassifierCV(classifier_linear)
clf.fit(train_vectors, train_labels)
prediction_linear = clf.predict_proba(test_vectors)
return prediction_linear
Do I need to pickle the vectorizer as well and reuse it later?
Upvotes: 0
Views: 525
Reputation: 2554
I got the problem. When I create new instance of TfidfVectorizer()
I am not using the same features that were used for the training. I made following change
linear_svc_model = clf.fit(train_vectors, train_labels)
model_object = []
model_object.append(linear_svc_model)
model_object.append(vectorizer)
and pickled this model_object. Then while using unpickled both classifier and vectorizer and used the same on training string. It worked.
Upvotes: 1