ndevito1
ndevito1

Reputation: 71

Flatten JSON in Pandas DataFrame with Inconsistent Format

I'm working with a column in a pandas dataframe that I would like to flatten into columns. The issue is that some columns are single dictionaries and others are lists of dictionaries.

So for instance the array in a cell in the column may look like this:

{'a': 'x', 'b': 'y', 'c': 'z'}

or like:

[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]

I've been playing around with the standard json_normalize function but the output can't seem to handle there being two different types of array:

json_flattened = pd.io.json.json_normalize(df.json_cell.apply(json.loads))

This gives me AttributeError: 'list' object has no attribute 'items'

What I'm hoping to get out the other end is a dataframe that will end up with as many columns as there are repeated elements in the array.

.-------.---.---.---.-----.-----.------.
| index | a | b | c | a_1 | b_1 | c_1  |
:-------+---+---+---+-----+-----+------:
|     0 | x | y | z | NaN | NaN | NaN  |
:-------+---+---+---+-----+-----+------:
|     1 | x | y | z | d   | e   | f    |
'-------'---'---'---'-----'-----'------'

Upvotes: 2

Views: 3241

Answers (1)

Ben Pap
Ben Pap

Reputation: 2579

Because of the inconsistent format in the column you have no choice but to loop through the column and inspect each cell.

import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


data = pd.DataFrame({'json':[{'a': 'x', 'b': 'y', 'c': 'z'},[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]]})

dflist = []
for index, row in data.iterrows():
    norm = json_normalize(data['json'][index])
    dflist.append(norm)

grouped = pd.concat(dflist, keys = np.arange(len(dflist)))

df2 = grouped.reset_index()

df2 = df2.pivot('level_0','level_1')

df2.index.name = None
df2.columns.names = [None, None]
df2.columns = df2.columns.map('{0[0]}_{0[1]}'.format) 

#output

    a_0  a_1    b_0  b_1    c_0  c_1
0   x    NaN    y    NaN    z    NaN
1   x    d      y    e      z    f

We loop through the column, reading each into row into its own df. Next we concat them together but assign each keys to keep them grouped. Next we pivot on the keys and the index. Finally we manipulate the multilevel columns generated from the pivot to get what we need.

Upvotes: 1

Related Questions