Reputation: 2012
Objective: To classify each tweet as positive or negative and write it to an output file which will contain the username, original tweet and the sentiment of the tweet.
Code:
import re,math
input_file="raw_data.csv"
fileout=open("Output.txt","w")
wordFile=open("words.txt","w")
expression=r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
fileAFINN = 'AFINN-111.txt'
afinn = dict(map(lambda (w, s): (w, int(s)), [ws.strip().split('\t') for ws in open(fileAFINN)]))
pattern=re.compile(r'\w+')
pattern_split = re.compile(r"\W+")
words = pattern_split.split(input_file.lower())
print "File processing started"
with open(input_file,'r') as myfile:
for line in myfile:
line = line.lower()
line=re.sub(expression," ",line)
words = pattern_split.split(line.lower())
sentiments = map(lambda word: afinn.get(word, 0), words)
#print sentiments
# How should you weight the individual word sentiments?
# You could do N, sqrt(N) or 1 for example. Here I use sqrt(N)
"""
Returns a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative valence.
"""
if sentiments:
sentiment = float(sum(sentiments))/math.sqrt(len(sentiments))
#wordFile.write(sentiments)
else:
sentiment = 0
wordFile.write(line+','+str(sentiment)+'\n')
fileout.write(line+'\n')
print "File processing completed"
fileout.close()
myfile.close()
wordFile.close()
Issue: Apparently the output.txt file is
abc some tweet text 0
bcd some more tweets 1
efg some more tweet 0
Question 1: How do I add a comma between the userid tweet-text sentiment? The output should be like;
abc,some tweet text,0
bcd,some other tweet,1
efg,more tweets,0
Question 2: The tweets are in Bahasa Melayu (BM) and the AFINN dictionary that I am using is of English words. So the classification is wrong. Do you know any BM dictionary that I can use?
Question 3: How do I pack this code in a JAR file?
Thank you.
Upvotes: 5
Views: 1071
Reputation: 596
Question 1:
output.txt
is currently simply composed of the lines you are reading in because of fileout.write(line+'\n')
. Since it is space separated, you can separate the line pretty easily
line_data = line.split(' ') # Split the line into a list, separated by spaces
user_id = line_data[0] # The first element of the list
tweets = line_data[1:-1] # The middle elements of the list
sentiment = line_data[-1] # The last element of the list
fileout.write(user_id + "," + " ".join(tweets) + "," + sentiment +'\n')
Question 2: A quick google search gave me this. Not sure if it has everything you will need though: https://archive.org/stream/grammardictionar02craw/grammardictionar02craw_djvu.txt
Question 3: Try Jython http://www.jython.org/archive/21/docs/jythonc.html
Upvotes: 1