Reputation: 13
I have lots of JSON files to parse, each between 1-2 Mb in size. Ordinarily I would have no issue loading data from a JSON as a dict using json.load(json_file). However, in this case the JSONs are strings of multiple nested dictionaries, all in one line.
Dictionaries are not delimited by "," as they would be in a list. I just have one very long string of nested dictionaries per file. For example, in the snippet below I have two nested dictionaries, each with a single key at the outer level of the dict ("GGGGHH" and "GGGHGH" for the first and second dictionaries, respectively).
{"GGGGHH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["229.0934"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["293.1353"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "a4": {"spectrum_89": ["202.1087"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}{"GGGHGH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["309.1312"], "spectrum_107": ["309.1314"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["213.0985"], "spectrum_107": ["213.0985"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}
I have seen examples of parsing multiple JSON objects, but only when they are in an array.
Can anyone help with this? I have no control over the format of the JSON files, so regenerating the data in an easier format is not an option. Apologies if this question has been answered before - I couldn't see any answers that would work for this particular case.
Upvotes: 1
Views: 1592
Reputation: 14233
This looks very much like malformed ndjson.
you can replace }{
with }\n{
and then use ndjson
import ndjson
with open('spam.json') as f:
source = f.read()
source = source.replace('}{', '}\n{')
data = ndjson.loads(source)
print(data)
Upvotes: 1
Reputation: 25489
Your string is invalid json, but it looks like it's just a bunch of valid json dictionaries joined back-to-back without commas.
Just add commas between the dictionaries by replacing any occurrences of "}{"
with "}, {"
, stick it in between "["
and "]"
to make it valid json for a list of dictionaries, and you're good to json.loads
!
s = '{"GGGGHH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["229.0934"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["293.1353"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "a4": {"spectrum_89": ["202.1087"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}{"GGGHGH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["309.1312"], "spectrum_107": ["309.1314"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["213.0985"], "spectrum_107": ["213.0985"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}'
json.loads("[" + s.replace("}{", "}, {") + "]")
Output:
[{'GGGGHH': {'b2': {'spectrum_89': ['115.0502']},
'b3': {'spectrum_89': ['172.0716'], 'spectrum_107': ['172.0717']},
'b4': {'spectrum_89': ['229.0934']},
'b5': {'spectrum_89': ['366.1527'], 'spectrum_107': ['366.1537']},
'y1': {'spectrum_89': ['156.0769'], 'spectrum_107': ['156.0769']},
'y2': {'spectrum_89': ['293.1353']},
'y3': {'spectrum_89': ['372.1407'], 'spectrum_107': ['350.1563']},
'a4': {'spectrum_89': ['202.1087']},
'ImH': {'spectrum_89': ['110.0715'], 'spectrum_107': ['110.0715']}}},
{'GGGHGH': {'b2': {'spectrum_89': ['115.0502']},
'b3': {'spectrum_89': ['172.0716'], 'spectrum_107': ['172.0717']},
'b4': {'spectrum_89': ['309.1312'], 'spectrum_107': ['309.1314']},
'b5': {'spectrum_89': ['366.1527'], 'spectrum_107': ['366.1537']},
'y1': {'spectrum_89': ['156.0769'], 'spectrum_107': ['156.0769']},
'y2': {'spectrum_89': ['213.0985'], 'spectrum_107': ['213.0985']},
'y3': {'spectrum_89': ['372.1407'], 'spectrum_107': ['350.1563']},
'ImH': {'spectrum_89': ['110.0715'], 'spectrum_107': ['110.0715']}}}]
For a more general case (for example, if there can exist whitespace between two dictionaries, use regular expressions to replace.
json.loads("[" + re.sub(r"\}\s*\{", "}, {", s) + "]")
where the regex "\}\s*\{"
matches }
, followed by 0 or more whitespace characters, followed by {
.
Upvotes: 0