Elias Mi
Elias Mi

Reputation: 711

Pandas: Flatten dict values in DataFrame

I've come across this problem very often now: I've got a nested dict (typically from an API / JSON payload) and need to flatten in for consumption as tabular data.

Example:

invoices = [
    {
        'id': 1,
        'currency': 'GBP',
        'lines': [
            {'line_id': 1, 'product': 'pencils', 'quantity': 100, 'price': 20},
            {'line_id': 2, 'product': 'erasers', 'quantity': 50, 'price': 10}
        ]
    },
    {
        'id': 2,
        'currency': 'USD',
        'lines': [
            {'line_id': 1, 'product': 'TV', 'quantity': 2, 'price': 800}
        ]
    }
]

pd.DataFrame(invoices) gives me a DataFrame with two lines, i.e. one per invoice. I can expand that to a per-line DataFrame as pd.DataFrame(invoices).explode('lines'), but then I have dicts in the lines column. How do I split/explode those dict keys into DataFrame columns?

Upvotes: 0

Views: 3324

Answers (2)

sitting_duck
sitting_duck

Reputation: 3720

A refinement to my previous answer. This one retains the identifying columns:

df = pd.DataFrame(invoices).explode('lines')
df.apply(lambda x: x[['id','currency']].append(pd.Series(x['lines'])), axis=1)

Which delivers:

   id currency  line_id  product  quantity  price
0   1      GBP        1  pencils       100     20
0   1      GBP        2  erasers        50     10
1   2      USD        1       TV         2    800

Upvotes: 2

Elias Mi
Elias Mi

Reputation: 711

Turns out that the latest version of pandas allows custom accessors, which you can use to make this possible:

# create per-line dataframe, as in the question
df = pd.DataFrame(invoices).explode('lines')

pd.concat([
    df.drop(columns=['lines']),  # remove nested column
    df['lines'].dict.explode()   # add flattened columns
], axis=1)

In order to be able to do this, you first need to define the .dict accessor:

@pd.api.extensions.register_series_accessor('dict')
class DictAccessor:
    def __init__(self, pandas_obj):
        self._obj = pandas_obj
        
    def __getitem__(self, key):
        return self._obj.apply(curried.get(key))
    
    def explode(self):
        return self._obj.apply(pd.Series)

This accessor also allows you to access individual keys in the dict, e.g.:

df = pd.DataFrame(invoices).explode('lines')

# total quantity across all invoices
df['lines'].dict['quantity'].sum()

# total cost per invoice
df.groupby('id').apply(lambda group: group['lines'].dict['price'].sum())

Upvotes: 2

Related Questions