Reputation: 11

Reading nested json to pandas dataframe

I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts.

URL: 'http://api.nobelprize.org/v1/laureate.json'

I have tried below code:

import json, pandas as pd,requests
resp=requests.get('http://api.nobelprize.org/v1/laureate.json')
df=pd.json_normalize(json.loads(resp.content),record_path =['laureates'])
print(df.head(5))

Output-

  id       firstname    surname        born        died  \
0  1  Wilhelm Conrad    Röntgen  1845-03-27  1923-02-10   
1  2      Hendrik A.    Lorentz  1853-07-18  1928-02-04   
2  3          Pieter     Zeeman  1865-05-25  1943-10-09   
3  4           Henri  Becquerel  1852-12-15  1908-08-25   
4  5          Pierre      Curie  1859-05-15  1906-04-19   

             bornCountry bornCountryCode                bornCity  \
0  Prussia (now Germany)              DE  Lennep (now Remscheid)   
1        the Netherlands              NL                  Arnhem   
2        the Netherlands              NL              Zonnemaire   
3                 France              FR                   Paris   
4                 France              FR                   Paris   

       diedCountry diedCountryCode   diedCity gender  \
0          Germany              DE     Munich   male   
1  the Netherlands              NL        NaN   male   
2  the Netherlands              NL  Amsterdam   male   
3           France              FR        NaN   male   
4           France              FR      Paris   male   

                                              prizes  
0  [{'year': '1901', 'category': 'physics', 'shar...  
1  [{'year': '1902', 'category': 'physics', 'shar...  
2  [{'year': '1902', 'category': 'physics', 'shar...  
3  [{'year': '1903', 'category': 'physics', 'shar...  
4  [{'year': '1903', 'category': 'physics', 'shar...

But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well.

I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.

Upvotes: 0

Answers (2)

Shivangi._k

Reputation: 11

Found an alternate solution as well with lesser code. This works.

from flatten_json import flatten
data = winners['laureates']
dict_flattened = (flatten(record, '.') for record in data)
df = pd.DataFrame(dict_flattened)
print(df.shape)

(968, 43)

Upvotes: 0

Akshay Sehgal

Reputation: 19307

You would have to do this in few steps.

The first step would be to extract the first record_path = ['laureates']
The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record
Combine the two datasets by joining on the id column.
Drop the unnecessary columns and store

import json, pandas as pd, requests

resp = requests.get('http://api.nobelprize.org/v1/laureate.json')
df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates'])
df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']])
output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False)

print('Shape of data ->',output.shape)
print('Columns ->',output.columns)

Shape of data -> (975, 18)

Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry',
       'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode',
       'diedCity', 'gender', 'year', 'category', 'share', 'motivation',
       'affiliations', 'overallMotivation'],
      dtype='object')

Upvotes: 1

Reading nested json to pandas dataframe

Answers (2)

Related Questions