Reputation: 23
The following are two examples of many lines that I need to analyze and extract specific words from.
[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl
[37.786221300000001, -122.1965002] 6 2011-08-28 19:55:26 I wish I could lay up with the love of my life And watch cartoons all day.
The coordinates and numbers are ignored
The case is to find how many of the words in each tweet line are present in this keywords list:
['hate', 1]
['hurt', 1]
['hurting', 1]
['like', 5]
['lonely', 1]
['love', 10]
And also, find the sum of the values (e.g ['love', 10]) of the keywords found in each tweet line.
For example, for the sentence
'I hate to feel lonely at times'
The sum of sentiments values for hate=1 and lonely=1 is equal to 2. And the no. of words in the line is 7.
I've tried to use list into lists method and even trying to go through each sentence and keywords, but those haven't worked because the no. of tweets and keywords are several and I need to use loop format to find the values.
Appreciate your insight in advance!! :)
My Code:
try:
KeywordFileName=input('Input keyword file name: ')
KeywordFile = open(KeywordFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
exit()
KeyLine = KeywordFile.readline()
while KeyLine != '':
if list != []:
KeyLine = KeywordFile.readline()
KeyLine = KeyLine.rstrip()
list = KeyLine.split(',')
list[1] = int(list[1])
print(list)
else:
break
try:
TweetFileName = input('Input Tweet file name: ')
TweetFile = open(TweetFileName, 'r')
except FileNotFoundError:
print('The file you entered does not exist or is not in the directory')
exit()
TweetLine = TweetFile.readline()
while TweetLine != '':
TweetLine = TweetFile.readline()
TweetLine = TweetLine.rstrip()
Upvotes: 0
Views: 793
Reputation: 550
You can use simple regular expression to extract the words and use a tokenizer to count the number of occurrence of each of them in your sample string.
from nltk.tokenize import word_tokenize
import collections
import re
str = '[40.748330000000003, -73.878609999999995] 6 2011-08-28 19:52:47 Sometimes I wish my life was a movie; #unreal I hate the fact I feel lonely surrounded by so many ppl'
num_regex = re.compile(r"[+-]?\d+(?:\.\d+)?")
str = num_regex.sub('',str)
words = word_tokenize(str)
final_list = collections.Counter(words)
print final_list
Upvotes: 1