Reputation: 119
In the first project I have trained and pickled a classification model that uses bag of words with 2500 features, but in this new project I want to actually classify new text.
How do I classify new text?
This is what I'm doing:
import pickle
# pickled TfidfVectorizer(max_features=2500)
vectorizer_in = open("vectorizer.pkl", "rb")
vectorizer = pickle.load(vectorizer_in)
# pickled RandomForestClassifier(n_estimators = 200, criterion = 'gini', class_weight="balanced")
classifier_in = open("classifier.pkl", "rb")
classifier = pickle.load(classifier_in)
# import libraries to clean the text
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('italian')
stopwords_set = set(stopwords.words('italian'))
# clean the input string
def cleanRow(row):
row = re.sub('[\n|\r]', ' ', row)
# regex here ...
row = row.split()
row = [stemmer.stem(word) for word in row if not word in stopwords_set]
row = ' '.join(row)
return row
def classify(summary, description):
corpus = cleanRow(summary + " " + description)
X_test = vectorizer.fit_transform([corpus]).toarray()
print(vectorizer.get_feature_names()) # ['cas', 'computer', 'cos', 'funzion', 'part', 'pc', 'pi', 'tav']
y_pred = classifier.predict(X_test)
# TODO map y_pred to the right label
return y_pred
out = classify("il computer non parte", "Stavo facendo cose a caso e non mi funziona più il pc.")
print(out)
This is the error generated:
X has 8 features per sample; expecting 2500
Indeed
vectorizer.get_feature_names()
# ['cas', 'computer', 'cos', 'funzion', 'part', 'pc', 'pi', 'tav']
but I want the original feature labels in the same order of when the model was created and trained.
Should I pickle the original array of features and by hand rebuild a new bag of words table for the new text that I want to classify?
Upvotes: 1
Views: 226
Reputation: 119
As said in the comment: "in a classify function, you have to use vectorizer.transform
and not fit_transform
".
Using
X_test = vectorizer.transform([corpus]).toarray()
solves the problem, as it isn't fitting the model again, but only creating the term matrix as input of the classification.
Upvotes: 1