Alana
Alana

Reputation: 89

Convert deeply nested JSON to Pandas dataframe

My data in JSON format:

[
    {
        "UNIT": "UNIT1",
        "PROJECTS": [
            {
                "PROJECT": "A",
                "PERIODS": [
                    {
                        "PERIOD": "2019",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "FANNY",
                                        "ID": 111
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 222
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "TIM",
                                        "ID": 444
                                    },
                                    {
                                        "NAME": "PAUL",
                                        "ID": 555
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "PROJECT": "B",
                "PERIODS": [
                    {
                        "PERIOD": "2021",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "BENNY",
                                        "ID": 121
                                    },
                                    {
                                        "NAME": "JENNY",
                                        "ID": 122
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "CHRIS",
                                        "ID": 123
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 124
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
]

Expected output dataframe

    UNIT PROJECT PERIOD   NAME   ID
0  UNIT1       A   2019  FANNY  111
1  UNIT1       A   2019   TANG  222
2  UNIT1       A   2019    TIM  444
3  UNIT1       A   2019   PAUL  555
4  UNIT1       B   2021  BENNY  121
5  UNIT1       B   2021  JENNY  122
6  UNIT1       B   2021  CHRIS  123
7  UNIT1       B   2021   TANG  124

I would like to store the data as formatted in the above JSON. In the future, the data structure could be big, so in order to make it nested, I choose the above way to store the data. However, I found it could be difficult to convert it back to dataframe.

The above JSON is deeply nested, I have tried pd.json_normalize but I cannot achieve the expected output.

Upvotes: 1

Views: 487

Answers (2)

You can do this and the advantage of this solution is that you never need to care about the paths in your json.

Defined the following function (it works on any json):

import json
import pandas as pd
def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            #print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

and then do the following:

with open(your_json_file) as f:
    data = json.load(f)
df = pd.json_normalize(data)

outdf = flatten_nested_json_df(df)

which returns:

   index   UNIT PROJECTS.PROJECT PROJECTS.PERIODS.PERIOD  \
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
..    ...    ...              ...                     ...   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   

   PROJECTS.PERIODS.TEAMS.TEAM PROJECTS.PERIODS.TEAMS.MEMBERS.NAME  \
0                       Team A                               FANNY   
0                       Team A                                TANG   
0                       Team A                                 TIM   
0                       Team A                                PAUL   
0                       Team A                               BENNY   
..                         ...                                 ...   
0                       Team B                                PAUL   
0                       Team B                               BENNY   
0                       Team B                               JENNY   
0                       Team B                               CHRIS   
0                       Team B                                TANG   

    PROJECTS.PERIODS.TEAMS.MEMBERS.ID  
0                                 111  
0                                 222  
0                                 444  
0                                 555  
0                                 121  
..                                ...  
0                                 555  
0                                 121  
0                                 122  
0                                 123  
0                                 124  

[2048 rows x 7 columns]

Upvotes: 0

LMark
LMark

Reputation: 58

You can use the Pandas library json_normalize function with some parameters.

It should like something like this

df = pd.json_normalize(
    name_of_the_file,
    meta=[
        'unit',
        ['unit', 'projects', 'project'],
        ['unit', 'projects', 'periods', 'period'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'name'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'id']
    ]
)

Into the meta parameter you should write the json path of each field which you want to show in your pandas dataframe.

Upvotes: 0

Related Questions