Reputation: 11561
I have a file that contains a stream of JSON dictionaries like this:
{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}
It also includes nested dictionaries and it looks like I cannot rely on a newline being a separator. I need a parser that could be used like this:
for d in getobjects(f):
handle_dict(d)
The point is that it would be perfect if the iteration only happened at the root level. Is there a Python parser that would handle all JSON's quirks? I am interested in a solution that would work on files that wouldn't fit into RAM.
Upvotes: 6
Views: 1699
Reputation: 76715
Here you go: a tested solution based on the answer from @Brien
This should be able to handle any arbitrary sized input file. It is a generator, so it yields up dictionary objects one at a time as it parses them out of the JSON input file.
If you run it as a stand-alone, it runs three test cases. (In the if __name__ == "__main__"
block)
Of course, to make this read from standard input you would simply pass sys.stdin
as the input file argument.
import json
_DECODER = json.JSONDecoder()
_DEFAULT_CHUNK_SIZE = 4096
_MB = (1024 * 1024)
_LARGEST_JSON_OBJECT_ACCEPTED = 16 * _MB # default to 16 megabytes
def json_objects_from_file(input_file,
chunk_size=_DEFAULT_CHUNK_SIZE,
max_size=_LARGEST_JSON_OBJECT_ACCEPTED):
"""
Read an input file, and yield up each JSON object parsed from the file.
Allocates minimal memory so should be suitable for large input files.
"""
buf = ''
while True:
temp = input_file.read(chunk_size)
if not temp:
break
# Accumulate more input to the buffer.
#
# The decoder is confused by leading white space before an object.
# So, strip any leading white space if any.
buf = (buf + temp).lstrip()
while True:
try:
# Try to decode a JSON object.
x, i = _DECODER.raw_decode(buf)
# If we got back a dict, we got a whole JSON object. Yield it.
if type(x) == dict:
# First, chop out the JSON from the buffer.
# Also strip any leading white space if any.
buf = buf[i:].lstrip()
yield x
except ValueError:
# Either the input is garbage or we got a partial JSON object.
# If it's a partial, maybe appending more input will finish it,
# so catch the error and keep handling input lines.
# Note that if you feed in a huge file full of garbage, this will grow
# very large. Blow up before reading an excessive amount of data.
if len(buf) >= max_size:
raise ValueError("either bad input or too-large JSON object.")
break
buf = buf.strip()
if buf:
if len(buf) > 70:
buf = buf[:70] + '...'
raise ValueError('leftover stuff from input: "{}"'.format(buf))
if __name__ == "__main__":
from StringIO import StringIO
jstring = '{"menu":\n"a"}{"c": []\n}\n{\n"d": [3,\n 2]}{\n"e":\n "}"}'
f = StringIO(jstring)
correct = [{u'menu': u'a'}, {u'c': []}, {u'd': [3, 2]}, {u'e': u'}'}]
result = list(json_objects_from_file(f, chunk_size=3))
assert result == correct
f = StringIO(' ' * (17 * _MB))
correct = []
result = list(json_objects_from_file(f, chunk_size=_MB))
assert result == correct
f = StringIO('x' * (17 * _MB))
correct = "ok"
try:
result = list(json_objects_from_file(f, chunk_size=_MB))
except ValueError:
result = correct
assert result == correct
Upvotes: 2
Reputation: 11561
Here is a partial solution, but it keeps slowing down as input goes:
#!/usr/bin/env pypy
import json
import cStringIO
import sys
def main():
BUFSIZE = 10240
f = sys.stdin
decoder = json.JSONDecoder()
io = cStringIO.StringIO()
do_continue = True
while True:
read = f.read(BUFSIZE)
if len(read) < BUFSIZE:
do_continue = False
io.write(read)
try:
data, offset = decoder.raw_decode(io.getvalue())
print(data)
rest = io.getvalue()[offset:]
if rest.startswith('\n'):
rest = rest[1:]
io = cStringIO.StringIO()
io.write(rest)
except ValueError, e:
#print(e)
#print(repr(io.getvalue()))
continue
if not do_continue:
break
if __name__ == '__main__':
main()
Upvotes: 0
Reputation: 6693
I think JSONDecoder.raw_decode may be what you're looking for. You may have to do some string formatting to get it in the perfect format depending on newlines and such, but with a bit of work, you'll probably be able to get something working. See this example.
import json
jstring = '{"menu": "a"}{"c": []}{"d": [3, 2]}{"e": "}"}'
substr = jstring
decoder = json.JSONDecoder()
while len(substr) > 0:
data,index = decoder.raw_decode(substr)
print data
substr = substr[index:]
Gives output:
{u'menu': u'a'}
{u'c': []}
{u'd': [3, 2]}
{u'e': u'}'}
Upvotes: 6