Reputation: 145
Lets say I have the following input data coming from a database.
data = [{'abc': 'None',
'cde': 4,
'efg': 1,
'blah': {'k': 23,
'y': 26,
'u':48}
},{'abc': 'hdf',
'cde':10,
'efg': 2,
'blah': {'k': 244,
'y': 747,
'u':75,
'g':88},
},
]
The following is the look of the pandas data frame which I would like to produce:
blah
cde abc g k u y
efg 1 4 'None' N/A 23 48 26
2 10 'hdf' 88 244 75 747
key features:
Attempts: When I take the data into a pandas data frame I get the following:
abc cde efg blah
k None 4 1 23
u None 4 1 48
y None 4 1 26
g hdf 10 2 88
k hdf 10 2 244
u hdf 10 2 75
y hdf 10 2 747
I suspect pandas is taking the nested dict keys as an index, so i reset it which gives the following:
index abc cde efg blah
0 k None 4 1 23
1 u None 4 1 48
2 y None 4 1 26
3 g hdf 10 2 88
4 k hdf 10 2 244
5 u hdf 10 2 75
6 y hdf 10 2 747
So I pivot this table by the pandas supplied 'index' column, i would expect it to have a stacked cross tab with the 'blah' parent as the top most item, with sub columns being its nested dict ('k','u', 'y', 'g'). In addition I expect the other columns to stay intact, with only the 'blah' section having merged cells. But, we get this instead:
>>> t_pivot = t_concat.pivot_table(columns='index')
>>> t_pivot
index g k u y
blah 88.0 133.5 61.5 386.5
cde 10.0 7.0 7.0 7.0
efg 2.0 1.5 1.5 1.5
Ok not great, So maybe i need to be explicit about what data we should index by:
t_pivot = t_concat.pivot_table(index = 'efg',columns=['index', 'cde'])
>>> t_pivot
blah
index g k u y
cde 10 4 10 4 10 4 10
efg
1 NaN 23.0 NaN 48.0 NaN 26.0 NaN
2 88.0 NaN 244.0 NaN 75.0 NaN 747.0
Still not right.
So any ideas on how to form a data frame from a dict with a nested dict? Primarily making cells merge properly for parents and their respective sub elements. The end result will go into an html document.
Other things I have tried: - turning the inner dict keys into tuples - using the stack() method, but cant seem to get anything looking right
Upvotes: 3
Views: 4614
Reputation: 17349
df = pd.DataFrame(data).set_index('efg')
sub_df = df.blah.apply(pd.Series)
sub_df = pd.concat({'blah': sub_df}, axis=1, names=['l1', 'l2'])
df = pd.concat({'': df}, axis=1, names=['l1', 'l2'])
df = pd.concat((df, sub_df), axis=1)
df = df.drop(('', 'blah'), axis=1)
print df
l1 blah
l2 abc cde g k u y
efg
1 None 4 NaN 23.0 48.0 26.0
2 hdf 10 88.0 244.0 75.0 747.0
The hierarchical column solution is based on this answer.
Upvotes: 1