Venki
Venki

Reputation: 1459

Loading data into dictionary

I enrolled in datascience coursework by Coursera and doing one of the assignments lead me to writing this code.

import sys
import json
import re

def lines(fp):
    print str(len(fp.readlines()))

def main():
        tweet_file = open(sys.argv[1])
        word_frequency_count = {}

        for line in tweet_file:
                raw_data = json.loads(line)
                #print raw_data
                text = raw_data.get('text', "").lower().encode('utf-8')
                new_text=re.findall(r"[\w']+", text)
                print new_text
                #print text
        for word in new_text:
                word_frequency_count[word] = 'Test'

        print word_frequency_count.items()


if __name__ == '__main__':
    main()

The print statement for new_text outputs lines like these and there are thousands of results like this just giving this a sample.

['rt', 'fuadagus2', 'presiden', 'sby', 'belilah', 'nuklir', 'kpd', 'korut', 'luncurkan', 'ke', 'israel', 'tunjukan', 'kalau', 'kamu', 'islam', 'prayforgaza']
['not', 'letting', 'nothing', 'else', 'get', 'in', 'my', 'way']

The thing that bothers me is the last print of the dictionary key value pairs which only outputs 5 key value pairs. I am a Java developer this is my first foray into python, Am i missing anything too obvious here?

Upvotes: 0

Views: 133

Answers (2)

jonrsharpe
jonrsharpe

Reputation: 122061

Your indentation is wrong:

for line in tweet_file:
    ...
for word in new_text:
    word_frequency_count[word] = 'Test'

The second loop happens outside the first loop, so only processes the new_text list from the last line in tweet_file. It should instead be:

for line in tweet_file:
    ...
    for word in new_text:
        word_frequency_count[word] = 'Test'

However, note that Python comes with "batteries included"; in this case, collections.Counter will make your life much easier.

Upvotes: 1

rje
rje

Reputation: 6428

The second for loop

for word in new_text

is outside your main loop (which loops over the lines in the file). That means it will only be executed once, AFTER the main loop has ended. At that point new_text will only contain the words from the last line.

Try moving your second loop inside the main loop..

Upvotes: 1

Related Questions