Reputation: 89
My data in JSON format:
[
{
"UNIT": "UNIT1",
"PROJECTS": [
{
"PROJECT": "A",
"PERIODS": [
{
"PERIOD": "2019",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "FANNY",
"ID": 111
},
{
"NAME": "TANG",
"ID": 222
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "TIM",
"ID": 444
},
{
"NAME": "PAUL",
"ID": 555
}
]
}
]
}
]
},
{
"PROJECT": "B",
"PERIODS": [
{
"PERIOD": "2021",
"TEAMS": [
{
"TEAM": "Team A",
"MEMBERS": [
{
"NAME": "BENNY",
"ID": 121
},
{
"NAME": "JENNY",
"ID": 122
}
]
},
{
"TEAM": "Team B",
"MEMBERS": [
{
"NAME": "CHRIS",
"ID": 123
},
{
"NAME": "TANG",
"ID": 124
}
]
}
]
}
]
}
]
}
]
Expected output dataframe
UNIT PROJECT PERIOD NAME ID
0 UNIT1 A 2019 FANNY 111
1 UNIT1 A 2019 TANG 222
2 UNIT1 A 2019 TIM 444
3 UNIT1 A 2019 PAUL 555
4 UNIT1 B 2021 BENNY 121
5 UNIT1 B 2021 JENNY 122
6 UNIT1 B 2021 CHRIS 123
7 UNIT1 B 2021 TANG 124
I would like to store the data as formatted in the above JSON. In the future, the data structure could be big, so in order to make it nested, I choose the above way to store the data. However, I found it could be difficult to convert it back to dataframe.
The above JSON is deeply nested, I have tried pd.json_normalize
but I cannot achieve the expected output.
Upvotes: 1
Views: 487
Reputation: 11522
You can do this and the advantage of this solution is that you never need to care about the paths in your json.
Defined the following function (it works on any json):
import json
import pandas as pd
def flatten_nested_json_df(df):
df = df.reset_index()
s = (df.applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df.applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
while len(list_columns) > 0 or len(dict_columns) > 0:
new_columns = []
for col in dict_columns:
horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
horiz_exploded.index = df.index
df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
new_columns.extend(horiz_exploded.columns) # inplace
for col in list_columns:
#print(f"exploding: {col}")
df = df.drop(columns=[col]).join(df[col].explode().to_frame())
new_columns.append(col)
s = (df[new_columns].applymap(type) == list).all()
list_columns = s[s].index.tolist()
s = (df[new_columns].applymap(type) == dict).all()
dict_columns = s[s].index.tolist()
return df
and then do the following:
with open(your_json_file) as f:
data = json.load(f)
df = pd.json_normalize(data)
outdf = flatten_nested_json_df(df)
which returns:
index UNIT PROJECTS.PROJECT PROJECTS.PERIODS.PERIOD \
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
0 0 UNIT1 A 2019
.. ... ... ... ...
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
0 0 UNIT1 B 2021
PROJECTS.PERIODS.TEAMS.TEAM PROJECTS.PERIODS.TEAMS.MEMBERS.NAME \
0 Team A FANNY
0 Team A TANG
0 Team A TIM
0 Team A PAUL
0 Team A BENNY
.. ... ...
0 Team B PAUL
0 Team B BENNY
0 Team B JENNY
0 Team B CHRIS
0 Team B TANG
PROJECTS.PERIODS.TEAMS.MEMBERS.ID
0 111
0 222
0 444
0 555
0 121
.. ...
0 555
0 121
0 122
0 123
0 124
[2048 rows x 7 columns]
Upvotes: 0
Reputation: 58
You can use the Pandas library json_normalize function with some parameters.
It should like something like this
df = pd.json_normalize(
name_of_the_file,
meta=[
'unit',
['unit', 'projects', 'project'],
['unit', 'projects', 'periods', 'period'],
['unit', 'projects', 'periods', 'teams', 'members', 'name'],
['unit', 'projects', 'periods', 'teams', 'members', 'id']
]
)
Into the meta parameter you should write the json path of each field which you want to show in your pandas dataframe.
Upvotes: 0