Reputation: 53
I have a file that contains an array of JSON objects. The file is over 1GB, so I can't load it into memory all at once. I need to parse each of the individual objects. I tried using ijson
, but that will load the entire array as one object, effectively doing the same thing as a simple json.load()
would.
Is there another way how to do it?
Edit: Nevermind, just use ijson.items()
and set the prefix parameter to "item"
.
Upvotes: 3
Views: 989
Reputation: 36249
You can parse the JSON file once to find the positions of each level-1 separator, i.e. a comma that is part of the top-level object, and then divide the file into sections indicated by these positions. For example:
{"a": [1, 2, 3], "b": "Hello, World!", "c": {"d": 4, "e": 5}}
^ ^ ^ ^ ^
| | | | |
level-2 | quoted | level-2
| |
level-1 level-1
Here we want to find the level-1 commas, that separate the objects which are contained by the top-level object. We can use a generator which parses the JSON stream and keeps track of descending into and stepping out of nested objects. When it encounters a level-1 comma that is not quoted it yields the corresponding position:
def find_sep_pos(stream, *, sep=','):
level = 0
quoted = False # handling strings in the json
backslash = False # handling quoted quotes
for pos, char in enumerate(stream):
if backslash:
backslash = False
elif char in '{[':
level += 1
elif char in ']}':
level -= 1
elif char == '"':
quoted = not quoted
elif char == '\\':
backslash = True
elif char == sep and not quoted and level == 1:
yield pos
Used on the example data above, this would give list(find_sep_pos(example)) == [15, 37]
.
Then we can divide the file into sections that correspond to the separator positions and load each section individually via json.loads
:
import itertools as it
import json
with open('example.json') as fh:
# Iterating over `fh` yields lines, so we chain them in order to get characters.
sep_pos = tuple(find_sep_pos(it.chain.from_iterable(fh)))
fh.seek(0) # reset to the beginning of the file
stream = it.chain.from_iterable(fh)
opening_bracket = next(stream)
closing_bracket = dict(('{}', '[]'))[opening_bracket]
offset = 1 # the bracket we just consumed adds an offset of 1
for pos in sep_pos:
json_str = (
opening_bracket
+ ''.join(it.islice(stream, pos - offset))
+ closing_bracket
)
obj = json.loads(json_str) # this is your object
next(stream) # step over the separator
offset = pos + 1 # adjust where we are in the stream right now
print(obj)
# The last object still remains in the stream, so we load it here.
obj = json.loads(opening_bracket + ''.join(stream))
print(obj)
Upvotes: 3
Reputation: 1936
2 options
Parse in CLI using tools like JQ and then take it to Python for further processing.
Parse using PySpark (community data bricks gives you free space)
Upvotes: 0