Reputation: 75
I am trying to classify spam messages using scikit machine learning.once i dump both vectorizer and classifier in respective.pkl files and import tem in temp.py for predictipn i am getting this error:
raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: CountVectorizer - Vocabulary wasn't fitted
Once I build a model saved the model with the name(my_model.pkl) ,(vectorizer.pkl)and restarting my kernel, but when I load the saved model(sample.pkl) during prediction on sample text it is giving same Volcubary not found error.
app.py:
import pandas as pd
df = pd.read_csv('spam.csv', encoding="latin-1")
#Drop the columns not needed
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
#Create a new column label which has the same values as v1 then set the ham and spam values to 0 and 1 which is the standard format for our prediction
df['label'] = df['v1'].map({'ham': 0, 'spam': 1})
#Create a new column having the same values as v2 column
df['message'] = df['v2']
#Now drop the v1 and v2
df.drop(['v1', 'v2'], axis=1, inplace=True)
#print(df.head(10))
from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer().fit_transform(df['message'])
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
#Split the data
X_train, X_test, y_train, y_test = train_test_split(bow_transformer, df['label'], test_size=0.33, random_state=42)
#Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)
clf.score(X_test,y_test)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
pickle.dump(clf, open("my_model.pkl", "wb"))
temp.py:::I am doing prediction in this file
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
vectorizer = pickle.load(open("my_model.pkl", "rb"))
selector = pickle.load(open("vector.pkl", "rb"))
test_set=["heloo how are u"]
new_test=cv.transform(test_set)
Upvotes: 0
Views: 507
Reputation: 4264
In your app.py
you are pickling the document-term matrix instead of the vectorizer,
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
where bow_transformer is
bow_transformer = CountVectorizer().fit_transform(df['message'])
And in your temp.py
when you unpickle it, you just have the document-term matrix.The right way to pickle it would be:
bow_transformer = CountVectorizer().fit(df['message'])
bow_transformer_dtm = bow_transformer.transform(df['message'])
Now you can pickle your bow_transformer
using
pickle.dump(bow_transformer, open("vector.pkl", "wb"))
which will be a transformer instead of the document term matrix.
And in your temp.py
you could unpickle it and use it as illustrated below:
selector = pickle.load(open("vector.pkl", "rb"))
test_set=["heloo how are u"]
new_test=selector.transform(test_set)
Upvotes: 1