Reputation: 145
I've created a sentiment script and use Naive Bayes to classify the reviews. I trained and tested my model and saved it in a Pickle object. Now I would like to perform on a new dataset my prediction but I always get following error message
raise ValueError('dimension mismatch') ValueError: dimension mismatch
It pops up on this line:
preds = nb.predict(transformed_review)[0]
Can anyone tell me if I'm doing something wrong? I do not understand the error.
This is my Skript:
sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')]
ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime
ta_review = pd.read_csv(review_akt_doc)
sentiment_de_class= ta_review
x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']
def text_process(text):
nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
nopunc = ''.join(nopunc)
noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc))
## stemming
stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)
stop = [word for word in stemmi.split() if word.lower() not in stopwords]
stop = ' '.join(stop)
return [word for word in stemmi.split() if word.lower() not in stopwords]
######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)
######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)
print 'starting training ..'
######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)
######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)
predis = []
######################
# Classify
######################
cols = ['SENTIMENT_CLASSIFY']
for sentiment in sentiment_de_class['REV']:
transformed_review = bow_transformer.transform([sentiment])
preds = nb.predict(transformed_review)[0] ##right here I get the error
predis.append(preds)
df = pd.DataFrame(predis, columns=cols)
Upvotes: 0
Views: 1805
Reputation: 36599
You need to save the CountVectorizer object too just as you are saving the nb
.
When you call
CountVectorizer(analyzer=text_process).fit(x)
you are re-training the CountVectorizer on new data, so the features (vocabulary) found by it will be different than at the training time and hence the saved nb
which was trained on the earlier features complain about dimension mismatch.
Better to pickle them in different files, but if you want you can save them in same file.
To pickle both in same object:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file) <=== Add this
pickle.dump(nb, file)
To read both in next call:
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)
Please look at this answer for more detail: https://stackoverflow.com/a/15463472/3374996
Upvotes: 1