Reputation: 71
I'm working with a column in a pandas dataframe that I would like to flatten into columns. The issue is that some columns are single dictionaries and others are lists of dictionaries.
So for instance the array in a cell in the column may look like this:
{'a': 'x', 'b': 'y', 'c': 'z'}
or like:
[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]
I've been playing around with the standard json_normalize function but the output can't seem to handle there being two different types of array:
json_flattened = pd.io.json.json_normalize(df.json_cell.apply(json.loads))
This gives me AttributeError: 'list' object has no attribute 'items'
What I'm hoping to get out the other end is a dataframe that will end up with as many columns as there are repeated elements in the array.
.-------.---.---.---.-----.-----.------.
| index | a | b | c | a_1 | b_1 | c_1 |
:-------+---+---+---+-----+-----+------:
| 0 | x | y | z | NaN | NaN | NaN |
:-------+---+---+---+-----+-----+------:
| 1 | x | y | z | d | e | f |
'-------'---'---'---'-----'-----'------'
Upvotes: 2
Views: 3241
Reputation: 2579
Because of the inconsistent format in the column you have no choice but to loop through the column and inspect each cell.
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
data = pd.DataFrame({'json':[{'a': 'x', 'b': 'y', 'c': 'z'},[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]]})
dflist = []
for index, row in data.iterrows():
norm = json_normalize(data['json'][index])
dflist.append(norm)
grouped = pd.concat(dflist, keys = np.arange(len(dflist)))
df2 = grouped.reset_index()
df2 = df2.pivot('level_0','level_1')
df2.index.name = None
df2.columns.names = [None, None]
df2.columns = df2.columns.map('{0[0]}_{0[1]}'.format)
#output
a_0 a_1 b_0 b_1 c_0 c_1
0 x NaN y NaN z NaN
1 x d y e z f
We loop through the column, reading each into row into its own df. Next we concat them together but assign each keys to keep them grouped. Next we pivot on the keys and the index. Finally we manipulate the multilevel columns generated from the pivot to get what we need.
Upvotes: 1