Modifying corpus by inserting codewords using Python

Question

I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:

This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad smell from the ice maker
The food was awesome, but the water was very rude

I want to change these texts to the following:

This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE

I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:

amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind

And, a text file contains such negative words as:

rude
poor
worst
dirty
slow
bad

So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.

Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.

Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").

#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings

import re

def multiple_replacer(*key_values):
    replace_dict = dict(key_values)
    replacement_function = lambda match: replace_dict[match.group(0)]
    pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
    return lambda string: pattern.sub(replacement_function, string)

def multiple_replace(string, *key_values):
    return multiple_replacer(*key_values)(string)

#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")      

my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))

d = []
with open("review.txt","r") as file:
    for line in file:
        review = line.strip()
        d.append(review) 

for line in d:
    print my_escaper(line)

Matthew Nizol · Accepted Answer

A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.

Example:

import re

reviews = [
    "This bike is amazing, but the brake is very poor",
    "This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
    "The food was awesome, but the water was very rude"
    ]

positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])

for sentence in reviews:
    tagged = []
    for word in re.split('\W+', sentence):
        tagged.append(word)
        if word.lower() in positive_words:
            tagged.append("POSITIVE")
        elif word.lower() in negative_words:
            tagged.append("NEGATIVE")
    print ' '.join(tagged)

While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().

Modifying corpus by inserting codewords using Python

Answers (2)

Related Questions