Reputation: 3
I am using a dataset with textual Yelp restaurant reviews and their "star" rating. My data is a df and looks like this:
Textual Review Numeric rating
"super cool restaurant" 5
"horrible experience" 1
I have built the MultinomialNB model which predicts the "star" (1-stands for negative, 5 stands for positive; using only these two categories) for the review.
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report
from nltk.corpus import stopwords
import string
import numpy
df = pd.read_csv('YELP_rev.csv')
#subsetting only the reviews on the extreme sides of the rating
df_class = df[(df['Numeric rating'] ==1) | (df['Numeric rating'] == 5)]
X = df_class['Textual review']
y = df_class['Numeric rating']
vectorizer=CountVectorizer()
X = vectorizer.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
nb = MultinomialNB()
#fiting the model with X_train, y_train
nb.fit(X_train, y_train)
#doing preditions
pred = nb.predict(X_test)
print(confusion_matrix(y_test, pred))
precision recall f1-score support
1 0.43 0.33 0.38 9
5 0.90 0.93 0.92 61
micro avg 0.86 0.86 0.86 70
macro avg 0.67 0.63 0.65 70
weighted avg 0.84 0.86 0.85 70
What I'm trying to do is to predict "star" rating for the user provided restaurant review. Here are my attempts:
test_review = input("Enter a review:")
def input_process(text):
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
new_x=vectorizer.transform(input_process(test_review))
test_review_rate = nb.predict(new_x)
print(test_review_rate)
I am not sure whether the output that I am getting is correct since I get an array of scores. Can someone help me interpret these scores? Do I just take the average and that will be my "star"rating for the review?
>>Enter a review:We had dinner here for my birthday in Stockholm. The restaurant was very popular, so I would advise you book in advance.Blahblah
#my output
>>[5 5 5 5 5 5 5 5 5 1 5 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
5 5 5 5]
ps I realize that the sample data is poor and my model is biased towards positive ratings! Thanks beforehand!
Upvotes: 0
Views: 477
Reputation: 1614
You need to join
your words back into a single string. Right now the output from your input_process
function is a list of words, so your model is interpreting each word as a separate input sample, which is why you are getting a score for each word in your review, instead of one score for the whole text.
Some changes in your code:
def input_process(text):
# Something you can try for removing punctuations
translator = str.maketrans('', '', string.punctuation)
nopunc = text.translate(translator)
words = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
# Join the words back and return as a string
return ' '.join(words)
# vectorizer.transform takes a list as input
# You will have to pass your single string input as a list
new_x=vectorizer.transform([input_process(test_review)])
Upvotes: 1