Coldchain9
Coldchain9

Reputation: 1745

Flattening Nested JSON Dict with Variable Amount of Dictionaries and List Elements

I have an example output JSON in dictionary form:

d = {'httpStatus': 200,
     'httpStatusMessage': 'Success',
     'timestamp': '2022-02-16T19:06:00.1924563Z',
     'response': {'header': {'companyId': '1000',
     'companyName': 'Bobs Groceries'},
     'body': {'dataSources': [{'dataSource': 3,
     'employees': [{'employeeId': '25',
       'employeeReference': '1500',
       'activeFlag': True,
       'createdDate': '2022-01-27T15:20:38.2700000Z',
       'lastUpdate': '2022-01-27T15:20:38.3500000Z',
       'firstName': 'Bob',
       'lastName': 'Brantley'},
      {'employeeId': '28',
       'employeeReference': '1505',
       'activeFlag': True,
       'createdDate': '2022-01-27T15:20:24.2400000Z',
       'lastUpdate': '2022-01-27T15:20:24.2400000Z',
       'firstName': 'Jeffrey',
       'lastName': 'Johnson'}]}]}}}

I would like to recursively parse this JSON and create a master dictionary with the list elements exploded into new dictionary keys with the values as lists. So essentially I want to convert lists of dictionaries into dictionaries of lists.

For the above example I desire the output to be something like:

output = {'httpStatus': 200,
          'httpStatusMessage': 'Success',
          'timestamp': '2022-02-16T19:06:00.1924563Z',
          'response_header_companyId': '1000',
          'companyName': 'Bobs Groceries',
          'dataSource': 3,
          'employeeId': ['25','28'],
          'employeeReference': ['1500','1505'],
          'activeFlag': [True, True],
          'createdDate': ['2022-01-27T15:20:38.2700000Z','2022-01-27T15:20:24.2400000Z'],
          'lastUpdate': ['2022-01-27T15:20:38.3500000Z','2022-01-27T15:20:24.2400000Z'],
          'firstName': ['Bob','Jeffrey'],
          'lastName': ['Brantley','Johnson']}

I can come somewhat close using this function:

def flatten_nested_json(d: dict)-> dict:
    """
    Accepts Dictionary argument which can have nested dictionaries/lists within.
    Output will be a flat dictionary that can be converted to a pandas dataframe
    """
    out = {}
    def flatten(x, name: str=''):
        # handles dictionaries with elements
        if type(x) is dict:
            for k in x:
                flatten(x[k], name + k + "_")
        # handles lists with elements
        elif type(x) is list:
            for j in x:
                if type(j) is dict:
                    for y, z in j.items():
                        out[y] = z
                else:
                    out[j] = x
        else: 
            out[name[:-1]] = x
    flatten(d)
    return out

However, the list of dictionaries with employeeId, etc. remains as a dictionary. I need to find a way to add the recursion for that part so I can explode the dictionaries into new keys with the values as lists. Basically I want to combine all of those dictionaries with similar keys in a list into a single dictionary with list elements containing the values of each dictionary.

A dynamic approach without pandas would be desired. Thanks!

Upvotes: 1

Views: 370

Answers (1)

quasi-human
quasi-human

Reputation: 1928

jsonpath-ng can parse even such a nested json object very easily. It can be installed by the following command:

pip install --upgrade jsonpath-ng

Code:

from collections import defaultdict
import jsonpath_ng as jp

def flatten_nested_json(d: dict)-> dict:
    expr = jp.parse('$..*')
    dd = defaultdict(list)
    for m in expr.find(d):
        if not isinstance(m.value, (dict, list)):
            dd[str(m.path)].append(m.value)
    for k, v in dd.items():
        if len(v) == 1:
            dd[k] = v[0]
    return dict(dd)

result = flatten_nested_json(d)

Explanation:

The key point is the following line:

expr = jp.parse('$..*')

By the above expression, you can search target elements at any level of the json object. README.rst tells you what each syntax stands for:

Syntax Meaning
$ The root object
jsonpath1 .. jsonpath2 All nodes matched by jsonpath2 that descend from any node matching jsonpath1
* any field

Output:

{'httpStatus': 200,
 'httpStatusMessage': 'Success',
 'timestamp': '2022-02-16T19:06:00.1924563Z',
 'companyId': '1000',
 'companyName': 'Bobs Groceries',
 'dataSource': 3,
 'employeeId': ['25', '28'],
 'employeeReference': ['1500', '1505'],
 'activeFlag': [True, True],
 'createdDate': ['2022-01-27T15:20:38.2700000Z',
  '2022-01-27T15:20:24.2400000Z'],
 'lastUpdate': ['2022-01-27T15:20:38.3500000Z',
  '2022-01-27T15:20:24.2400000Z'],
 'firstName': ['Bob', 'Jeffrey'],
 'lastName': ['Brantley', 'Johnson']}

[EDIT]

If you want a full path to each value, try the following code:

Code (A full path as a key):

from collections import defaultdict
import jsonpath_ng as jp
import re

auto_id_field = 'json_path'
jp.jsonpath.auto_id_field = auto_id_field

def flatten_nested_json(d: dict)-> dict:
    expr = jp.parse(f'$..*.{auto_id_field}')
    pat = re.compile(r'(?:\.\[\d+]\.|\.)')
    dd = defaultdict(list)
    for m in expr.find(d):
        if not isinstance(m.datum.value, (dict, list)):
            dd[pat.sub('_', m.value)].append(m.datum.value)
    for k, v in dd.items():
        if len(v) == 1:
            dd[k] = v[0]
    return dict(dd)

result = flatten_nested_json(d)

Output (A full path as a key):

{'httpStatus': 200,
 'httpStatusMessage': 'Success',
 'timestamp': '2022-02-16T19:06:00.1924563Z',
 'response_header_companyId': '1000',
 'response_header_companyName': 'Bobs Groceries',
 'response_body_dataSources_dataSource': 3,
 'response_body_dataSources_employees_employeeId': ['25', '28'],
 'response_body_dataSources_employees_employeeReference': ['1500', '1505'],
 'response_body_dataSources_employees_activeFlag': [True, True],
 'response_body_dataSources_employees_createdDate': ['2022-01-27T15:20:38.2700000Z',
  '2022-01-27T15:20:24.2400000Z'],
 'response_body_dataSources_employees_lastUpdate': ['2022-01-27T15:20:38.3500000Z',
  '2022-01-27T15:20:24.2400000Z'],
 'response_body_dataSources_employees_firstName': ['Bob', 'Jeffrey'],
 'response_body_dataSources_employees_lastName': ['Brantley', 'Johnson']}

Upvotes: 1

Related Questions