Reputation: 165
New to Python and Pandas, working on getting the hang of jsons. Any help appreciated.
Via an API I'm pulling a nested json. The structure of the json is below. The fields I'm after are in view
, labeled user_id
and message
, and then under the nested field replies
the subfields user_id
and message
. The desired fields are labeled below with <<<
],
"view": [
{
"id": 109205,
"user_id": 6354, # <<<< this field
"parent_id": null,
"created_at": "2020-11-03T23:32:49Z",
"updated_at": "2020-11-03T23:32:49Z",
"rating_count": null,
"rating_sum": null,
"message": "message text1", # <<< this field
"replies": [
{
"id": 109298,
"user_id": 5457, # <<< this field
"parent_id": 109205,
"created_at": "2020-11-04T19:42:59Z",
"updated_at": "2020-11-04T19:42:59Z",
"rating_count": null,
"rating_sum": null,
"message": "message text2" # <<< this field
},
{
#json continues
I can successfully pull the top level fields under view
, but I'm having difficulty flattening the nested json field replies
with json_normalize
. Here's my working code:
import pandas as pd
d = r.json() # json pulled from API
df = pd.json_normalize(d['view'], record_path=['replies'])
print(df)
Which results in the following KeyError:
Traceback (most recent call last):
File "C:\Users\danie\AppData\Local\Temp\atom_script_tempfiles\2021720-13268-1xuqx61.3oh2g", line 53, in <module>
df = pd.json_normalize(d['view'], record_path=['replies'])
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 336, in _json_normalize
_recursive_extract(data, record_path, {}, level=0)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 309, in _recursive_extract
recs = _pull_records(obj, path[0])
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 248, in _pull_records
result = _pull_field(js, spec)
File "C:\Users\danie\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 239, in _pull_field
result = result[spec]
KeyError: 'replies'
What am I missing here? All suggestions welcome and appreciated.
Upvotes: 1
Views: 824
Reputation: 11395
You’re trying to flatten 2 different “depths” in the json file, which can’t be done in a single json_normalize
call. You could simply use 2 pd.json_normalize
since all entries contain id
s to match all the parsed data later:
>>> pd.json_normalize(d, record_path='view')
id user_id parent_id created_at updated_at rating_count rating_sum message replies
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 [{'id': 109298, 'user_id': 5457, 'parent_id': ...
>>> pd.json_normalize(d, record_path=['view', 'replies'])
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
(I’ve added as second reply
to your example with same data and id
incremented by 1 so we can see what happens for several replies per view.)
Alternately, you can use your second pd.json_normalize
on the replies
column of your previous result, which is probably less work. This is more interesting if you .explode()
the column first to get one row per reply:
>>> pd.json_normalize(view['replies'].explode())
id user_id parent_id created_at updated_at rating_count rating_sum message
0 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
So here’s a way to construct a single dataframe with all the info:
>>> view = pd.json_normalize(d, record_path='view')
>>> df = pd.merge(
... view.drop(columns=['replies']),
... pd.json_normalize(view['replies'].explode()),
... left_on='id', right_on='parent_id', how='right',
... suffixes=('_view', '_reply')
... )
>>> df
id_view user_id_view parent_id_view created_at_view updated_at_view rating_count_view rating_sum_view message_view id_reply user_id_reply parent_id_reply created_at_reply updated_at_reply rating_count_reply rating_sum_reply message_reply
0 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109298 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text2
1 109205 6354 None 2020-11-03T23:32:49Z 2020-11-03T23:32:49Z None None message text1 109299 5457 109205 2020-11-04T19:42:59Z 2020-11-04T19:42:59Z None None message text3
>>> df[['user_id_view', 'message_view', 'user_id_reply', 'message_reply']]
user_id_view message_view user_id_reply message_reply
0 6354 message text1 5457 message text2
1 6354 message text1 5457 message text3
Upvotes: 3