Dan
Dan

Reputation: 487

Pandas: How to transform nested json with dynamic keys and arrays to pandas dataframe

How to transform nested json with dynamic keys and arays to pandas dataframe?

For solutions I tried please see below useful links.

input json:

{
    "id": 1,
    "data": {
        "data_1_a": {
            "name_1a": {
                "label": "label_1",
                "units": {
                       "A": [{"date": 2020, "val": 1}]}}
        },
        "data_1000_xyz": {
            "name_1b": {
                "label": "null",
                "units": {
                        "B": [{"date": 2019, "val": 2},
                          {"date": 2020, "val": 3}]},
            },
            "name_10000_xyz": {
                "label": "null",
                "units": {
                       "A": [
                        {"date": 2018, "val": 4, "num": "str"},
                        {"date": 2019, "val": 5},
                        {"date": 2020, "val": 6, "num": "str"},
                    ]
                },
            },
        },
    },
}

required output df:

+---+--------------+----------------+---------+-------+------+-----+------+
|id |level_1       |level_2         |level_3  |level_4| date | val | num  | 
+---+--------------+----------------+---------+-------+------+-----+------+
|1  |data_1_a      | name_1a        | unit    | A     | 2020 |  1  | null |
|1  |data_1000_xyz | name_1b        | unit    | B     | 2019 |  2  | null |
|1  |data_1000_xyz | name_1b        | unit    | B     | 2020 |  3  | null |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2018 |  4  | str  |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2019 |  5  | null |
|1  |data_1000_xyz | name_10000_xyz | unit    | A     | 2020 |  6  | str  |
+-------------------------------------------------------------------------+

Usefull links:

Upvotes: 0

Views: 1022

Answers (1)

Dan
Dan

Reputation: 487

Python Pandas solution:

    import pandas as pd
    
    # 1) flatten json
    df = pd.json_normalize(json_1)
    df_dic = df.to_dict('records')
    
    # 2) split to levels
    data = []
    for row in df_dic:
        k={}
        for item in row.items():
            if item[0] == 'id':
                id = item[1]
            else:  
                keys  = item[0].split('.')
                k = {i:s for i,s in enumerate(keys)}
                k.update({'value':item[1]})
                k.update({'id':id})
                data.append(k)
    
    df = (pd.DataFrame(data)[['id',1,2,3,4,'value']]
    .rename(columns={1:'level_1',2:'level_2',3:'level_3',4:'level_4' }))
    df = df.loc[~df['level_4'].isnull()]

    # 3) explode
    dfe = df.explode('value', ignore_index=True)
    
    # 4) pop the value column and create a new dataframe from it then join the new frame with the exploded frame.
    output_df = dfe.join(pd.DataFrame([*dfe.pop('value')], index=dfe.index))
    
    
        id  level_1         level_2         level_3 level_4 date    val num
    0   1   data_1_a        name_1a         units   A      2020     1   NaN
    1   1   data_1000_xyz   name_1b         units   B      2019     2   NaN
    2   1   data_1000_xyz   name_1b         units   B      2020     3   NaN
    3   1   data_1000_xyz   name_10000_xyz  units   A      2018     4   str
    4   1   data_1000_xyz   name_10000_xyz  units   A      2019     5   NaN
    5   1   data_1000_xyz   name_10000_xyz  units   A      2020     6   str

Upvotes: 1

Related Questions