Reputation: 110203
I have downloaded 5MB of a very large json file. From this, I need to be able to load that 5MB to generate a preview of the json file. However, the file will probably be incomplete. Here's an example of what it may look like:
[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
From here, I'd like to "rebuild it" so that it can parse the first two objects (and ignore the third).
Is there a json parser that can infer
or cut off the end of the string to make it parsable? Or perhaps to 'stream' the parsing of the json array, so that when it fails on the last object, I can exit the loop? If not, how could the above be accomplished?
Upvotes: 9
Views: 5597
Reputation: 36249
For the specific structure in the example we can walk through the string and track occurrences of curly brackets and their closing counterparts. If at the end one or more curly brackets remain unmatched, we know that this indicates an incomplete object. We can then strip any intermediate characters such as commas or whitespace and close the resulting string with a square bracket.
This method ensures that the string is only parsed twice, one time manually and one time by the JSON parser, which might be advantageous for large text files (with incomplete objects consisting of many characters).
brackets = []
for i, c in enumerate(string):
if c == '{':
brackets.append(i)
elif c == '}':
brackets.pop()
if brackets:
string = string[:brackets[0]].rstrip(', \n')
if not string.endswith(']'):
string += ']'
Upvotes: 1
Reputation: 11232
If your data will always look somewhat similar, you could do something like this:
import json
json_string = """[{
"first": "bob",
"address": {
"street": 13301,
"zip": 1920
}
}, {
"first": "sarah",
"address": {
"street": 13301,
"zip": 1920
}
}, {"first" : "tom"
"""
while True:
if not json_string:
raise ValueError("Couldn't fix JSON")
try:
data = json.loads(json_string + "]")
except json.decoder.JSONDecodeError:
json_string = json_string[:-1]
continue
break
print(data)
This assumes that the data is a list of dicts. Step by step, the last character is removed and a missing ]
appended. If the new string can be interpreted as JSON, the infinite loop breaks. Otherwise the next character is removed and so on. If there are no characters left ValueError("Couldn't fix JSON")
is raised.
For the above example, it prints:
[{'first': 'bob', 'address': {'zip': 1920, 'street': 13301}}, {'first': 'sarah', 'address': {'zip': 1920, 'street': 13301}}]
Upvotes: 8