Reputation: 1459
I enrolled in datascience coursework by Coursera and doing one of the assignments lead me to writing this code.
import sys
import json
import re
def lines(fp):
print str(len(fp.readlines()))
def main():
tweet_file = open(sys.argv[1])
word_frequency_count = {}
for line in tweet_file:
raw_data = json.loads(line)
#print raw_data
text = raw_data.get('text', "").lower().encode('utf-8')
new_text=re.findall(r"[\w']+", text)
print new_text
#print text
for word in new_text:
word_frequency_count[word] = 'Test'
print word_frequency_count.items()
if __name__ == '__main__':
main()
The print statement for new_text outputs lines like these and there are thousands of results like this just giving this a sample.
['rt', 'fuadagus2', 'presiden', 'sby', 'belilah', 'nuklir', 'kpd', 'korut', 'luncurkan', 'ke', 'israel', 'tunjukan', 'kalau', 'kamu', 'islam', 'prayforgaza']
['not', 'letting', 'nothing', 'else', 'get', 'in', 'my', 'way']
The thing that bothers me is the last print of the dictionary key value pairs which only outputs 5 key value pairs. I am a Java developer this is my first foray into python, Am i missing anything too obvious here?
Upvotes: 0
Views: 133
Reputation: 122061
Your indentation is wrong:
for line in tweet_file:
...
for word in new_text:
word_frequency_count[word] = 'Test'
The second loop happens outside the first loop, so only processes the new_text
list from the last line
in tweet_file
. It should instead be:
for line in tweet_file:
...
for word in new_text:
word_frequency_count[word] = 'Test'
However, note that Python comes with "batteries included"; in this case, collections.Counter
will make your life much easier.
Upvotes: 1
Reputation: 6428
The second for loop
for word in new_text
is outside your main loop (which loops over the lines in the file). That means it will only be executed once, AFTER the main loop has ended. At that point new_text will only contain the words from the last line.
Try moving your second loop inside the main loop..
Upvotes: 1