constructing pandas dataframe from csv that has columns of dictionaries

Question

I have a csv that contains multiple columns filled with a single dict. There are thousands of rows. I want to pull these dicts out and make columns out of their keys and fill the cells in with their values, filling in NaN where a value is missing. So that:

   id                            attributes
0   255RSSSTCHL-QLTDGLZD-BLK     {"color": "Black", "hardware": "Goldtone"}
1   C3ACCRDNFLP-QLTDS-S-BLK      {"size": "Small", "color": "Black"}

Becomes:

   id                            size   color   hardware  
0   255RSSSTCHL-QLTDGLZD-BLK     NaN    Black   Goldtone
1   C3ACCRDNFLP-QLTDS-S-BLK      Small  Black   NaN

There are several columns like 'id' that I would like to keep untouched in the resulting DataFrame and there are several columns like 'attributes' that are filled with dicts that I want to blow out into columns. I truncated them to the example above for illustration.

MaxU - stand with Ukraine · Accepted Answer

Source DF:

In [172]: df
Out[172]:
                         id                               attributes                       attr2
0  255RSSSTCHL-QLTDGLZD-BLK  {"color":"Black","hardware":"Goldtone"}  {"aaa":"aaa", "bbb":"bbb"}
1   C3ACCRDNFLP-QLTDS-S-BLK         {"size":"Small","color":"Black"}               {"ccc":"ccc"}

Solution 1:

import ast

attr_cols = ['attributes','attr2']

def f(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))


for col in attr_cols:
    df = f(df, col)

Solution 2: thanks to @DYZ for the hint:

import json

attr_cols = ['attributes','attr2']

def f(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(json.loads(x))))

for col in attr_cols:
    df = f(df, col)

Result:

In [175]: df
Out[175]:
                         id  color  hardware   size  aaa  bbb  ccc
0  255RSSSTCHL-QLTDGLZD-BLK  Black  Goldtone    NaN  aaa  bbb  NaN
1   C3ACCRDNFLP-QLTDS-S-BLK  Black       NaN  Small  NaN  NaN  ccc

Timing: for 20.000 rows DF:

In [198]: df = pd.concat([df] * 10**4, ignore_index=True)

In [199]: df.shape
Out[199]: (20000, 3)

In [201]: %paste
def f_ast(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(ast.literal_eval(x))))

def f_json(df, attr_col):
    return df.join(df.pop(attr_col) \
             .apply(lambda x: pd.Series(json.loads(x))))
## -- End pasted text --

In [202]: %%timeit
     ...: for col in attr_cols:
     ...:     f_ast(df.copy(), col)
     ...:
1 loop, best of 3: 33.1 s per loop

In [203]:

In [203]: %%timeit
     ...: for col in attr_cols:
     ...:     f_json(df.copy(), col)
     ...:
1 loop, best of 3: 30 s per loop

In [204]: df.shape
Out[204]: (20000, 3)

constructing pandas dataframe from csv that has columns of dictionaries

Answers (2)

Related Questions