Reputation: 898
I ran into a curious problem while parsing json objects in large text files, and the solution I found doesn't really make much sense. I was working with the following script. It copies bz2 files, unzips them, then parses each line as a json object.
import os, sys, json
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# USER INPUT
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
args = sys.argv
extractDir = outputDir = ""
if (len(args) >= 2):
extractDir = args[1]
else:
extractDir = raw_input('Directory to extract from: ')
if (len(args) >= 3):
outputDir = args[2]
else:
outputDir = raw_input('Directory to output to: ')
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# RETRIEVE FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
tweetModel = [u'id', u'text', u'lang', u'created_at', u'retweeted', u'retweet_count', u'in_reply_to_user_id', u'coordinates', u'place', u'hashtags', u'in_reply_to_status_id']
filenames = next(os.walk(extractDir))[2]
for file in filenames:
if file[-4:] != ".bz2":
continue
os.system("cp " + extractDir + '/' + file + ' ' + outputDir)
os.system("bunzip2 " + outputDir + '/' + file)
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# PARSE DATA
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
input = open (outputDir + '/' + file[:-4], 'r')
output = open (outputDir + '/p_' + file[:-4], 'w+')
for line in input.readlines():
try:
tweet = json.loads(line)
for field in enumerate(tweetModel):
if tweet.has_key(field[1]) and tweet[field[1]] != None:
if field[0] != 0:
output.write('\t')
fieldData = tweet[field[1]]
if not isinstance(fieldData, unicode):
fieldData = unicode(str(fieldData), "utf-8")
output.write(fieldData.encode('utf8'))
else:
output.write('\t')
except ValueError as e:
print ("Parse Error: " + str(e))
print line
line = input.readline()
quit()
continue
print "Success! " + str(len(line))
input.flush()
output.write('\n')
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# REMOVE OLD FILE
# =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
os.system("rm " + outputDir + '/' + file[:-4])
While reading in certain lines in the for line in input.readlines():
loop, the lines would occasionally be truncated at inconsistent locations. Since the newline character was truncated as well, it would keep reading until it found the newline character at the end of the next json object. The result was an incomplete json object followed by a complete json object, all considered one line by the parser. I could not find the reason for this issue, but I did find that changing the loop to
filedata = input.read()
for line in filedata.splitlines():
worked. Does anyone know what is going on here?
Upvotes: 2
Views: 589
Reputation: 496
After looking at the source code for file.readlines and string.splitlines I think I see whats up. Note: This is python 2.7 source code so if you're using another version... maybe this answer pertains maybe not.
readlines uses the function Py_UniversalNewlineFread to test for a newline splitlines uses a constant STRINGLIB_ISLINEBREAK that just tests for \n or \r. I would suspect Py_UniversalNewlineFread is picking up some character in the file stream as linebreak when its not really intended as a line break, could be from the encoding.. I don't know... but when you just dump all that same data to a string the splitlines checks it against \r and \n theres no match so splitlines moves on until the real line break is encountered and you get your intended line.
Upvotes: 2