Reputation: 711
I've come across this problem very often now: I've got a nested dict (typically from an API / JSON payload) and need to flatten in for consumption as tabular data.
Example:
invoices = [
{
'id': 1,
'currency': 'GBP',
'lines': [
{'line_id': 1, 'product': 'pencils', 'quantity': 100, 'price': 20},
{'line_id': 2, 'product': 'erasers', 'quantity': 50, 'price': 10}
]
},
{
'id': 2,
'currency': 'USD',
'lines': [
{'line_id': 1, 'product': 'TV', 'quantity': 2, 'price': 800}
]
}
]
pd.DataFrame(invoices)
gives me a DataFrame with two lines, i.e. one per invoice. I can expand that to a per-line DataFrame as pd.DataFrame(invoices).explode('lines')
, but then I have dict
s in the lines
column. How do I split/explode those dict
keys into DataFrame columns?
Upvotes: 0
Views: 3324
Reputation: 3720
A refinement to my previous answer. This one retains the identifying columns:
df = pd.DataFrame(invoices).explode('lines')
df.apply(lambda x: x[['id','currency']].append(pd.Series(x['lines'])), axis=1)
Which delivers:
id currency line_id product quantity price
0 1 GBP 1 pencils 100 20
0 1 GBP 2 erasers 50 10
1 2 USD 1 TV 2 800
Upvotes: 2
Reputation: 711
Turns out that the latest version of pandas allows custom accessors, which you can use to make this possible:
# create per-line dataframe, as in the question
df = pd.DataFrame(invoices).explode('lines')
pd.concat([
df.drop(columns=['lines']), # remove nested column
df['lines'].dict.explode() # add flattened columns
], axis=1)
In order to be able to do this, you first need to define the .dict
accessor:
@pd.api.extensions.register_series_accessor('dict')
class DictAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def __getitem__(self, key):
return self._obj.apply(curried.get(key))
def explode(self):
return self._obj.apply(pd.Series)
This accessor also allows you to access individual keys in the dict
, e.g.:
df = pd.DataFrame(invoices).explode('lines')
# total quantity across all invoices
df['lines'].dict['quantity'].sum()
# total cost per invoice
df.groupby('id').apply(lambda group: group['lines'].dict['price'].sum())
Upvotes: 2