Daniel Hutchinson
Daniel Hutchinson

Reputation: 165

Python & Pandas: Flattening nested json with pd.json_normalize

New to Python and Pandas, working on getting the hang of jsons. Any help appreciated.

Via an API I'm pulling a nested json. The structure of the json is below. The fields I'm after are in view, labeled user_id and message, and then under the nested field replies the subfields user_id and message. The desired fields are labeled below with <<<

  ],
  "view": [
    {
      "id": 109205,
      "user_id": 6354, # <<<< this field
      "parent_id": null,
      "created_at": "2020-11-03T23:32:49Z",
      "updated_at": "2020-11-03T23:32:49Z",
      "rating_count": null,
      "rating_sum": null,
      "message": "message text1", # <<< this field
      "replies": [
        {
          "id": 109298,
          "user_id": 5457, # <<< this field
          "parent_id": 109205,
          "created_at": "2020-11-04T19:42:59Z",
          "updated_at": "2020-11-04T19:42:59Z",
          "rating_count": null,
          "rating_sum": null,
          "message": "message text2" # <<< this field
        },
        {
         #json continues

I can successfully pull the top level fields under view, but I'm having difficulty flattening the nested json field replies with json_normalize. Here's my working code:

import pandas as pd

d = r.json() # json pulled from API

df = pd.json_normalize(d['view'], record_path=['replies'])

print(df)

Which results in the following KeyError:

Traceback (most recent call last):
  File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
    df = pd.json_normalize(d['view'], record_path=['replies'])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
    _recursive_extract(data, record_path, {}, level=0)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
    recs = _pull_records(obj, path[0])
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
    result = _pull_field(js, spec)
  File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
    result = result[spec]
KeyError: 'replies'

What am I missing here? All suggestions welcome and appreciated.

Upvotes: 1

Views: 824

Answers (1)

Cimbali
Cimbali

Reputation: 11395

You’re trying to flatten 2 different “depths” in the json file, which can’t be done in a single json_normalize call. You could simply use 2 pd.json_normalize since all entries contain ids to match all the parsed data later:

>>> pd.json_normalize(d, record_path='view')
       id  user_id parent_id            created_at            updated_at rating_count rating_sum        message                                            replies
0  109205     6354      None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z         None       None  message text1  [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

(I’ve added as second reply to your example with same data and id incremented by 1 so we can see what happens for several replies per view.)

Alternately, you can use your second pd.json_normalize on the replies column of your previous result, which is probably less work. This is more interesting if you .explode() the column first to get one row per reply:

>>> pd.json_normalize(view['replies'].explode())
       id  user_id  parent_id            created_at            updated_at rating_count rating_sum        message
0  109298     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text2
1  109299     5457     109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z         None       None  message text3

So here’s a way to construct a single dataframe with all the info:

>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
...     view.drop(columns=['replies']),
...     pd.json_normalize(view['replies'].explode()),
...     left_on='id', right_on='parent_id', how='right',
...     suffixes=('_view', '_reply')
... )
>>> df
   id_view  user_id_view parent_id_view       created_at_view       updated_at_view rating_count_view rating_sum_view   message_view  id_reply  user_id_reply  parent_id_reply      created_at_reply      updated_at_reply rating_count_reply rating_sum_reply  message_reply
0   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109298           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text2
1   109205          6354           None  2020-11-03T23:32:49Z  2020-11-03T23:32:49Z              None            None  message text1    109299           5457           109205  2020-11-04T19:42:59Z  2020-11-04T19:42:59Z               None             None  message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
   user_id_view   message_view  user_id_reply  message_reply
0          6354  message text1           5457  message text2
1          6354  message text1           5457  message text3

Upvotes: 3

Related Questions