Reputation: 11
Apologies as I'm quite inexperienced with Python and looking for assistance here 🙏
I am trying to flatten a list of dicts
into new columns in an existing dataframe
, with using key as column names and values as cell content. Now json_normalize
is successful in unwrapping the data, but fails to place dict*value*1
as column header, and dict*value*2
as cell content. In the examples I saw in other posts, it did work, but somehow not for me.
The column originally looks like this :
What I obtain is something like this:
View after unwrapping with json_normalize:
Obviously what I'd like to achieve is something like this:
Desired Outpu:
I have tried pd.DataFrame(form_records(df\["fields"\])
, and also tried pd.json_normalize(df\["fields"\])
which both came to the same output. The returned dataframe
has a dict
in each cell which is not what I am trying to get. What I'd like is "id" as column name, and "value" as cell content for every new column created.
Eventually I tried after json_normalize
to extract the "ids" to rename my columns, like so :
#extracting row 1
my_list = fields_values.iloc[:1].values.flatten().tolist()
# getting all values under the "id" key
ids = [d['id'] for d in my_list]
#extracting all current column names
col_names = fields_values.columns.tolist()
#zipping the extract "ids" with the current col names
new_col_names = dict(zip(col_names, ids))
#renaming the columns with the ids
fields_values.rename(columns=new_col_names, inplace=True)
This eventually gave me the desired col names, but I still have the cell content with the dicts instead of only the value under "value" ... So I'm not sure if this is the best practice :
Correct col names but incorrect cell content:
Does any one has any advice on this? Is there something I'm missing in the json_normalize
function arguments, or shall I continue with extracting per row/col the dict values and replace in each cell (I find this to be such a pain!)? Isn't there a more efficient way to do this?
Upvotes: 1
Views: 210
Reputation: 4608
You can use explode()
to convert list elements to new rows. Then use pd.Series
to convert dict type values to new columns. Then use transpose
to transpose index and column:
df = df.explode("fields")["fields"].apply(pd.Series)
df = df.groupby("id")["value"].apply(list).to_frame().T
df = df.explode(df.columns.tolist(),ignore_index=True)
Example:
df = pd.DataFrame({"fields":[[{"id":2753537689,"value":"abc"},{"id":27537849,"value":123}],
[{"id":2753537689,"value":"bcd"},{"id":27537849,"value":456}]]})
'''
fields
0 [{'id': 2753537689, 'value': 'abc'}, {'id': 27537849, 'value': 123}]
1 [{'id': 2753537689, 'value': 'bcd'}, {'id': 27537849, 'value': 456}]
'''
df = df.explode("fields")["fields"].apply(pd.Series)
df = df.groupby("id")["value"].apply(list).to_frame().T
df = df.explode(df.columns.tolist(),ignore_index=True)
Out:
id 27537849 2753537689
0 123 abc
1 456 bcd
Note: explode()
may cause performance or memory problems in large datasets.
Upvotes: 1