Reputation: 255
I'm trying to parse a really large JSON file in Python. The file has 6523440 lines but is broken into a lot of JSON objects.
The structure looks like this:
[
{
"projects": [
...
]
}
]
[
{
"projects": [
...
]
}
]
....
....
....
and it goes on and on...
Every time I try to load it using json.load() I get an error
ValueError: Extra data: line 2247 column 1 - line 6523440 column 1 (char 101207 - 295464118)
On the line where the first object ends and the second one starts. Is there a way to load them separately or anything similar?
Upvotes: 3
Views: 2799
Reputation: 40843
Try using json.JSONDecoder.raw_decode
. It still requires you to have the entire document in memory, but allows you to iteratively decode many objects from one string.
import re
import json
document = """
[
1,
2,
3
]
{
"a": 1,
"b": 2,
"c": 3
}
"""
not_whitespace = re.compile(r"\S")
decoder = json.JSONDecoder()
items = []
index = 0
while True:
match = not_whitespace.search(document, index)
if not match:
break
item, index = decoder.raw_decode(document, match.start())
items.append(item)
print(items)
Upvotes: 0