Roberto
Roberto

Reputation: 729

How to transform pandas JSON column into dataframe?

I have a .csv file with mix of columns where some contain entries in JSON syntax (nested). I want to extract relevant data from these columns to obtain a more data-rich dataframe for further analysis. I've checked this tutorial on Kaggle but I failed to obtain the desired result.

In order to better explain my problem I've prepared a dummy version of a database below.

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}

df = pd.DataFrame(raw)

I'd like to generate the following columns (or equivalent):

team      name_1   name_2   age_1    age_2
Team_1    Andy     Rick     22       30
Team_2    Oli      Joe      19       21

I've tried the following.

Code 1:

test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'

Code 2:

test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers

Code 3:

test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers

Is there any way to do it in an effectively? I've looked for a solution in other stackoverflow topics and I cannot find a working solution with json_normalize.

Upvotes: 0

Views: 2281

Answers (3)

panktijk
panktijk

Reputation: 1614

You can iterate through each element in raw['who'] separately, but when you do this the resultant dataframe will have both the opponents in separate rows.

Example:

json_normalize(raw['who'][0])

Output:

age     name
22      Andy
30      Rick

You can flatten these out into a single row and then concatenate all the rows to get your final output.

def flatten(df_temp):
    df_temp.index = df_temp.index.astype(str)
    flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
    flattened_df.columns = flattened_df.columns.map('_'.join)
    return flattened_df

df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']

Output:

age_0   name_0  age_1   name_1  team
22      Andy    30      Rick    Team_1
19      Oli     21      Joe     Team_2

Upvotes: 0

Joery
Joery

Reputation: 366

One option would be to unpack the dictionary yourself. Like so:

from pandas.io.json import json_normalize 

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}


# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
    for person in list_of_people:
        person['team'] = raw['team'][idx]

# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]

# normalize to dataframe
json_normalize(list_of_dicts)

# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)

This outputs a little different. My output is often more convenient for further analysis.

Output:

age name    team
22  Andy    Team_1
30  Rick    Team_1
19  Oli     Team_2
21  Joe     Team_2

Upvotes: 0

James Dellinger
James Dellinger

Reputation: 1261

I also had trouble using json_normalize on the lists of dicts that were contained in the who column. My workaround was to reformat each row into a Dict with unique keys (name_1, age_1, name_2, etc.) for each team member's name/age. After this, creating a dataframe with your desired structure was trivial.

Here are my steps. Beginning with your example:

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}

df = pd.DataFrame(raw)
df

    team    who
0   Team_1  [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1   Team_2  [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...
  1. Write a method to reformat a list as a Dict and apply to each row in the who column:
def reformat(x):
    res = {}
    for i, item in enumerate(x):
        res['name_' + str(i+1)] = item['name']
        res['age_' + str(i+1)] = item['age']
    return res

df['who'] = df['who'].apply(lambda x: reformat(x))
df

    team    who
0   Team_1  {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1   Team_2  {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...
  1. Use json_normalize on the who column. Then ensure the columns of the normalized dataframe appear in the desired order:
import pandas as pd 
from pandas.io.json import json_normalize

n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n

    name_1  name_2  age_1   age_2
0   Andy    Rick       22      30
1   Oli     Joe        19      21
  1. Join the dataframe created by json_normalize back to the original df, and drop the who column:
df = df.join(n).drop('who', axis=1)
df

    team    name_1  name_2  age_1   age_2
0   Team_1  Andy    Rick       22      30
1   Team_2  Oli     Joe        19      21

If your real .csv file has too many rows, my solution may be a bit too expensive (seeing as how it iterates over each row, and then over each entry inside the list contained in each row). If (hopefully) this isn't the case, perhaps my approach will be good enough.

Upvotes: 1

Related Questions