Reputation: 11793
I have a three-level hierarchical dataset, like below:
pd.DataFrame({'level1': ['A', 'A', 'A', 'A', 'B'],
'level2': ['A1', 'A1', 'A2', 'A2', 'B1' ],
'level3': ['a', 'b', 'c', 'd', 'e'],
'value': [44, 125, 787, 99, 111],
'pctChg': [0.3, -0.9, -10.0, 12, -0.2]})
level1 level2 level3 value pctChg
0 A A1 a 44 0.3
1 A A1 b 125 -0.9
2 A A2 c 787 -10.0
3 A A2 d 99 12.0
4 B B1 e 111 -0.2
For a specific level1 category like A
, there are level2 category like A1, A2
.
under levels two, there are many level3 categories. For example, 'a' and 'b' are under 'A1',
'c' and 'd' are under A2. This data is just an example. For each combination, there are value and percent info (percentage change from last month).
I need to tranform this data into a nested json data. It needs to be the format like below:
{
name: “root”,
value: 1166,
pctChg: xx%,
children: [
{
name: 'A',
value: 956,
pctChg: 'xx%'',
children: [{
name: 'A1',
value: 169,
pctChg: 'xx%'',
children: [{name: 'a', value: 44, pctChg: '30%'},
{name:'b', value:125, pctChg: '-90%'},
{name:'c', value:787, pctChg: '-10%'}
]
}, .....]
},
.....…
]
}
We also need to aggregate the value for a level from all childrens at one level down. Value can be aggragated obviously. One tricky part is percent. We might not want to simply aggregate the percentage.
This looks like a pretty difficult task. Not like some simple nested json data. I'm not sure how I can approach that. Appreciate if anyone can help. Thanks a lot in advance.
Upvotes: 1
Views: 504
Reputation: 30991
The first step is to reformat pctChg column to percentage string:
df.pctChg = (df.pctChg * 100).astype(int).astype(str) + '%'
(I assumed multiply by 100 formula).
Then define 2 functions computing children of the second and first level:
def chld2(grp):
return grp.rename(columns={'level3': 'name'}).groupby('level2')\
.apply(lambda grp: pd.Series({'name': grp.iloc[0,1], 'value': grp.value.sum(),
'pctChg': 'xx%', 'children': grp[['name', 'value', 'pctChg']].to_dict('r') }))\
.to_dict('r')
def chld1(df):
return df.groupby('level1').apply(lambda grp: pd.Series({
'name': grp.iloc[0,0], 'value': grp.value.sum(), 'pctChg': 'xx%',
'children': chld2(grp)})).to_dict('r')
And to generate the result, run:
pd.Series({'name': 'root', 'value': df.value.sum(), 'pctChg': 'xx%',
'children': chld1(df)}).to_json()
The result (with manually added indentation for readability) is:
{ "name":"root", "value":1166, "pctChg":"xx%",
"children":[
{ "name":"A", "value":1055, "pctChg":"xx%",
"children":[
{ "name":"A1", "value":169, "pctChg":"xx%",
"children":[
{"name":"a", "value":44, "pctChg":"30%"},
{"name":"b", "value":125, "pctChg":"-90%"}
]
},
{ "name":"A2", "value":886, "pctChg":"xx%",
"children":[
{"name":"c", "value":787, "pctChg":"-1000%"},
{"name":"d", "value":99, "pctChg":"1200%"}
]
}
]
},
{ "name":"B", "value":111, "pctChg":"xx%",
"children":[
{ "name":"B1", "value":111, "pctChg":"xx%",
"children":[
{"name":"e", "value":111, "pctChg":"-20%"}
]
}
]
}
]
}
Upvotes: 1