Fat_princess
Fat_princess

Reputation: 11

Flatten list of dict (array) as new columns not working with json normalize

Apologies as I'm quite inexperienced with Python and looking for assistance here 🙏

I am trying to flatten a list of dicts into new columns in an existing dataframe, with using key as column names and values as cell content. Now json_normalize is successful in unwrapping the data, but fails to place dict*value*1 as column header, and dict*value*2 as cell content. In the examples I saw in other posts, it did work, but somehow not for me.

The column originally looks like this :

View of the column (1D array) before unwrapping

What I obtain is something like this:

View after unwrapping with json_normalize:

View after unwrapping with json_normalize

Obviously what I'd like to achieve is something like this:

Desired Outpu:

Desired Output

I have tried pd.DataFrame(form_records(df\["fields"\]), and also tried pd.json_normalize(df\["fields"\]) which both came to the same output. The returned dataframe has a dict in each cell which is not what I am trying to get. What I'd like is "id" as column name, and "value" as cell content for every new column created.

Eventually I tried after json_normalize to extract the "ids" to rename my columns, like so :

#extracting row 1
my_list = fields_values.iloc[:1].values.flatten().tolist()

# getting all values under the "id" key 
ids = [d['id'] for d in my_list]

#extracting all current column names 
col_names = fields_values.columns.tolist()

#zipping the extract "ids" with the current col names 
new_col_names = dict(zip(col_names, ids))

#renaming the columns with the ids 
fields_values.rename(columns=new_col_names, inplace=True)

This eventually gave me the desired col names, but I still have the cell content with the dicts instead of only the value under "value" ... So I'm not sure if this is the best practice :

Correct col names but incorrect cell content:

correct col names but incorrect cell content

Does any one has any advice on this? Is there something I'm missing in the json_normalize function arguments, or shall I continue with extracting per row/col the dict values and replace in each cell (I find this to be such a pain!)? Isn't there a more efficient way to do this?

Upvotes: 1

Views: 210

Answers (1)

Bushmaster
Bushmaster

Reputation: 4608

You can use explode() to convert list elements to new rows. Then use pd.Series to convert dict type values to new columns. Then use transpose to transpose index and column:

df = df.explode("fields")["fields"].apply(pd.Series)
df = df.groupby("id")["value"].apply(list).to_frame().T
df = df.explode(df.columns.tolist(),ignore_index=True)

Example:

df = pd.DataFrame({"fields":[[{"id":2753537689,"value":"abc"},{"id":27537849,"value":123}],
                                  [{"id":2753537689,"value":"bcd"},{"id":27537849,"value":456}]]})
'''
    fields
0   [{'id': 2753537689, 'value': 'abc'}, {'id': 27537849, 'value': 123}]
1   [{'id': 2753537689, 'value': 'bcd'}, {'id': 27537849, 'value': 456}]

'''
df = df.explode("fields")["fields"].apply(pd.Series)
df = df.groupby("id")["value"].apply(list).to_frame().T
df = df.explode(df.columns.tolist(),ignore_index=True)

Out:

id    27537849   2753537689
0        123        abc
1        456        bcd

Note: explode() may cause performance or memory problems in large datasets.

Upvotes: 1

Related Questions