Python - Readline skipping characters

Question

I ran into a curious problem while parsing json objects in large text files, and the solution I found doesn't really make much sense. I was working with the following script. It copies bz2 files, unzips them, then parses each line as a json object.

import os, sys, json

# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
#              USER INPUT
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

args = sys.argv
extractDir = outputDir = ""

if (len(args) >= 2):
    extractDir = args[1]
else:
    extractDir = raw_input('Directory to extract from: ')

if (len(args) >= 3):
    outputDir = args[2]
else:
    outputDir = raw_input('Directory to output to: ')

# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   
#            RETRIEVE FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   

tweetModel = [u'id', u'text', u'lang', u'created_at', u'retweeted', u'retweet_count', u'in_reply_to_user_id', u'coordinates', u'place', u'hashtags', u'in_reply_to_status_id']

filenames = next(os.walk(extractDir))[2]
for file in filenames:
    if file[-4:] != ".bz2":
        continue

    os.system("cp " + extractDir + '/' + file + ' ' + outputDir)
    os.system("bunzip2 " + outputDir + '/' + file)

# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   
#            PARSE DATA
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   

    input = open (outputDir + '/' + file[:-4], 'r')
    output = open (outputDir + '/p_' + file[:-4], 'w+')

    for line in input.readlines():
        try:
            tweet = json.loads(line)
            for field in enumerate(tweetModel):
                if tweet.has_key(field[1]) and tweet[field[1]] != None:
                    if field[0] != 0:
                        output.write('	')
                    fieldData = tweet[field[1]]
                    if not isinstance(fieldData, unicode):
                        fieldData = unicode(str(fieldData), "utf-8")

                    output.write(fieldData.encode('utf8'))
                else:
                    output.write('	')

        except ValueError as e:
            print ("Parse Error: " + str(e))
            print line
            line = input.readline()
            quit()
            continue

        print "Success! " + str(len(line))
        input.flush()

        output.write('
')

# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   
#          REMOVE OLD FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=   

    os.system("rm " +  outputDir + '/' + file[:-4])

While reading in certain lines in the for line in input.readlines(): loop, the lines would occasionally be truncated at inconsistent locations. Since the newline character was truncated as well, it would keep reading until it found the newline character at the end of the next json object. The result was an incomplete json object followed by a complete json object, all considered one line by the parser. I could not find the reason for this issue, but I did find that changing the loop to

filedata = input.read()
for line in filedata.splitlines():

worked. Does anyone know what is going on here?

Python - Readline skipping characters

Answers (1)

Related Questions