David Doran
David Doran

Reputation: 13

Python: Parse JSON string with multiple nested dicts in single line

I have lots of JSON files to parse, each between 1-2 Mb in size. Ordinarily I would have no issue loading data from a JSON as a dict using json.load(json_file). However, in this case the JSONs are strings of multiple nested dictionaries, all in one line.

Dictionaries are not delimited by "," as they would be in a list. I just have one very long string of nested dictionaries per file. For example, in the snippet below I have two nested dictionaries, each with a single key at the outer level of the dict ("GGGGHH" and "GGGHGH" for the first and second dictionaries, respectively).

{"GGGGHH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["229.0934"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["293.1353"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "a4": {"spectrum_89": ["202.1087"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}{"GGGHGH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["309.1312"], "spectrum_107": ["309.1314"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["213.0985"], "spectrum_107": ["213.0985"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}

I have seen examples of parsing multiple JSON objects, but only when they are in an array.

Can anyone help with this? I have no control over the format of the JSON files, so regenerating the data in an easier format is not an option. Apologies if this question has been answered before - I couldn't see any answers that would work for this particular case.

Upvotes: 1

Views: 1592

Answers (2)

buran
buran

Reputation: 14233

This looks very much like malformed ndjson. you can replace }{ with }\n{ and then use ndjson

import ndjson
with open('spam.json') as f:
    source = f.read()
    source = source.replace('}{', '}\n{')
    data = ndjson.loads(source)

print(data)

Upvotes: 1

pho
pho

Reputation: 25489

Your string is invalid json, but it looks like it's just a bunch of valid json dictionaries joined back-to-back without commas.

Just add commas between the dictionaries by replacing any occurrences of "}{" with "}, {", stick it in between "[" and "]" to make it valid json for a list of dictionaries, and you're good to json.loads!

s = '{"GGGGHH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["229.0934"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["293.1353"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "a4": {"spectrum_89": ["202.1087"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}{"GGGHGH": {"b2": {"spectrum_89": ["115.0502"]}, "b3": {"spectrum_89": ["172.0716"], "spectrum_107": ["172.0717"]}, "b4": {"spectrum_89": ["309.1312"], "spectrum_107": ["309.1314"]}, "b5": {"spectrum_89": ["366.1527"], "spectrum_107": ["366.1537"]}, "y1": {"spectrum_89": ["156.0769"], "spectrum_107": ["156.0769"]}, "y2": {"spectrum_89": ["213.0985"], "spectrum_107": ["213.0985"]}, "y3": {"spectrum_89": ["372.1407"], "spectrum_107": ["350.1563"]}, "ImH": {"spectrum_89": ["110.0715"], "spectrum_107": ["110.0715"]}}}'
json.loads("[" + s.replace("}{", "}, {") + "]")

Output:

[{'GGGGHH': {'b2': {'spectrum_89': ['115.0502']},
   'b3': {'spectrum_89': ['172.0716'], 'spectrum_107': ['172.0717']},
   'b4': {'spectrum_89': ['229.0934']},
   'b5': {'spectrum_89': ['366.1527'], 'spectrum_107': ['366.1537']},
   'y1': {'spectrum_89': ['156.0769'], 'spectrum_107': ['156.0769']},
   'y2': {'spectrum_89': ['293.1353']},
   'y3': {'spectrum_89': ['372.1407'], 'spectrum_107': ['350.1563']},
   'a4': {'spectrum_89': ['202.1087']},
   'ImH': {'spectrum_89': ['110.0715'], 'spectrum_107': ['110.0715']}}},
 {'GGGHGH': {'b2': {'spectrum_89': ['115.0502']},
   'b3': {'spectrum_89': ['172.0716'], 'spectrum_107': ['172.0717']},
   'b4': {'spectrum_89': ['309.1312'], 'spectrum_107': ['309.1314']},
   'b5': {'spectrum_89': ['366.1527'], 'spectrum_107': ['366.1537']},
   'y1': {'spectrum_89': ['156.0769'], 'spectrum_107': ['156.0769']},
   'y2': {'spectrum_89': ['213.0985'], 'spectrum_107': ['213.0985']},
   'y3': {'spectrum_89': ['372.1407'], 'spectrum_107': ['350.1563']},
   'ImH': {'spectrum_89': ['110.0715'], 'spectrum_107': ['110.0715']}}}]

For a more general case (for example, if there can exist whitespace between two dictionaries, use regular expressions to replace.

json.loads("[" + re.sub(r"\}\s*\{", "}, {", s) + "]")

where the regex "\}\s*\{" matches }, followed by 0 or more whitespace characters, followed by {.

Upvotes: 0

Related Questions