Reputation: 2014
I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:
I want to change these texts to the following:
I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:
And, a text file contains such negative words as:
So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.
Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.
Specifically, my_escaper
works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)
Upvotes: 1
Views: 217
Reputation: 2659
A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.
Example:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split()
.
Upvotes: 1
Reputation: 1936
If I understood correctly, you need something like:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
Is it OK with you?
Upvotes: 0