nasia jaffri
nasia jaffri

Reputation: 823

Extract indivudual tweets from a textfile with no line breaks using Python

I am trying to read tweets from a text file from a URL

http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt

Tweets in the file are listed in a single line (there are no line breaks) and punctuated by “EndOfTweet” string. I am reading the file using the following code:

import urllib2
wfd = urllib2.urlopen('http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt')
data = wfd.read()

I understand that I have to use split on "EndOfTweet" in order to seperate the lines, but since there is only one line, I do not understand how to loop through the file and separate each line.

for line in data:
    line = data.split('EndOfTweet')

Upvotes: 0

Views: 252

Answers (1)

VooDooNOFX
VooDooNOFX

Reputation: 4762

You're so close!

by the time you've called wfd.read(), data will contain the raw text of that file. The normal way to loop over a file is to call something like for line in data, which is just looking for newlines to split the data on. In this case, your data doesn't contain the normal newline terminator. Instead, he's using the text EndOfTweet to separate your lines. Here's what you should have done:

import urllib2
import json
wfd = urllib2.urlopen('http://rasinsrv07.cstcis.cti.depaul.edu/CSC455/assignment5.txt')
data = wfd.read()
for line in data.split('EndOfTweet'):
    # From here, line will contain a single tweet. It appears this line is a JSON parsable structure.
    decoded_line = json.loads(line)
    # Now, lets print out the text of the tweet to show we can
    print decoded_line.get(u'text')

Upvotes: 1

Related Questions