Reputation: 729
I have a .csv file with mix of columns where some contain entries in JSON syntax (nested). I want to extract relevant data from these columns to obtain a more data-rich dataframe for further analysis. I've checked this tutorial on Kaggle but I failed to obtain the desired result.
In order to better explain my problem I've prepared a dummy version of a database below.
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
I'd like to generate the following columns (or equivalent):
team name_1 name_2 age_1 age_2
Team_1 Andy Rick 22 30
Team_2 Oli Joe 19 21
I've tried the following.
Code 1:
test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'
Code 2:
test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers
Code 3:
test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers
Is there any way to do it in an effectively? I've looked for a solution in other stackoverflow topics and I cannot find a working solution with json_normalize.
Upvotes: 0
Views: 2281
Reputation: 1614
You can iterate through each element in raw['who']
separately, but when you do this the resultant dataframe will have both the opponents in separate rows.
Example:
json_normalize(raw['who'][0])
Output:
age name
22 Andy
30 Rick
You can flatten these out into a single row and then concatenate all the rows to get your final output.
def flatten(df_temp):
df_temp.index = df_temp.index.astype(str)
flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
flattened_df.columns = flattened_df.columns.map('_'.join)
return flattened_df
df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']
Output:
age_0 name_0 age_1 name_1 team
22 Andy 30 Rick Team_1
19 Oli 21 Joe Team_2
Upvotes: 0
Reputation: 366
One option would be to unpack the dictionary yourself. Like so:
from pandas.io.json import json_normalize
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
for person in list_of_people:
person['team'] = raw['team'][idx]
# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]
# normalize to dataframe
json_normalize(list_of_dicts)
# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)
This outputs a little different. My output is often more convenient for further analysis.
Output:
age name team
22 Andy Team_1
30 Rick Team_1
19 Oli Team_2
21 Joe Team_2
Upvotes: 0
Reputation: 1261
I also had trouble using json_normalize on the lists of dicts that were contained in the who
column. My workaround was to reformat each row into a Dict with unique keys (name_1
, age_1
, name_2
, etc.) for each team member's name/age. After this, creating a dataframe with your desired structure was trivial.
Here are my steps. Beginning with your example:
raw = {"team":["Team_1","Team_2"],
"who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}
df = pd.DataFrame(raw)
df
team who
0 Team_1 [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1 Team_2 [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...
who
column:def reformat(x):
res = {}
for i, item in enumerate(x):
res['name_' + str(i+1)] = item['name']
res['age_' + str(i+1)] = item['age']
return res
df['who'] = df['who'].apply(lambda x: reformat(x))
df
team who
0 Team_1 {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1 Team_2 {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...
who
column. Then ensure the columns of the normalized dataframe appear in the desired order:import pandas as pd
from pandas.io.json import json_normalize
n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n
name_1 name_2 age_1 age_2
0 Andy Rick 22 30
1 Oli Joe 19 21
who
column:df = df.join(n).drop('who', axis=1)
df
team name_1 name_2 age_1 age_2
0 Team_1 Andy Rick 22 30
1 Team_2 Oli Joe 19 21
If your real .csv file has too many rows, my solution may be a bit too expensive (seeing as how it iterates over each row, and then over each entry inside the list contained in each row). If (hopefully) this isn't the case, perhaps my approach will be good enough.
Upvotes: 1