Reputation: 55
I have a Json file that is structured like below[1], as you can see multiple keywords are attached to one newspaper article. I want to normalize the Json into a such a structure(DataFrame)[2]. Ive tried it with json_normalize but that didnt worked out as intended, also did some multiindexing but i cant save the results in csv formats and it makes everything more complex. What i want is to get the data in a structure to analyze it and label the whole article based on the extracted keywords as positive or negative.
[2]
╔═══════════════╦════════════╦═══════════════╗
║ url ║ date ║ entities.name ║
║ http://ww.... ║ 2018-12-31 ║ 2018 ║
║ --------------║------------║ Bill Cosby ║
║ ║ ║ Actress ║
║ ║ ║ ... ║
╚═══════════════╩════════════╩═══════════════╝
[1]
{'lang': 'ENGLISH',
'date': '2018-12-31T23:46:18Z',
'url': 'http://www.newschannel6now.com/2018/12/31/cosby-kanye-box-office-diversity-biggest-entertainment-stories/',
'entities': [{'avgSalience': 1,
'wikipediaEntry': '2018',
'type': 'DATE',
'numMentions': 4,
'name': '2018',
'nameNorm': '2018'},
{'wikipediaEntry': 'Actor',
'type': 'COMMON',
'numMentions': 4,
'avgSalience': 0.72,
'nameNorm': 'actres',
'name': 'Actress'},
{'wikipediaEntry': 'Bill Cosby',
'type': 'PROPER',
'numMentions': 2,
'avgSalience': 0.57,
'nameNorm': 'bill cosby',
'name': 'Bill Cosby'},
{'name': 'music superstar',
'nameNorm': 'music superstar',
'avgSalience': 0.02,
'type': 'COMMON',
'numMentions': 1}]}
I managed by using group by and joining the values into one single column:
df.groupby(['url','date'], as_index=False).agg({
'name': lambda x: ', '.join(x),
'numMentions': lambda x: ', '.join(map(str,x)),
'avgSalience':lambda x: ', '.join(map(str,x))
})
Upvotes: 1
Views: 184
Reputation: 28644
You can use json_normalize:
from pandas import json_normalize
json_normalize(data,'entities',['url','date']).filter(['url','date','name'])
url date name
0 http://www.newschannel6now.com/2018/12/31/cosb... 2018-12-31T23:46:18Z 2018
1 http://www.newschannel6now.com/2018/12/31/cosb... 2018-12-31T23:46:18Z Actress
2 http://www.newschannel6now.com/2018/12/31/cosb... 2018-12-31T23:46:18Z Bill Cosby
3 http://www.newschannel6now.com/2018/12/31/cosb... 2018-12-31T23:46:18Z music superstar
Here is another option. I am relying on a library called nested_lookup to pull the data:
from nested_lookup import nested_lookup
keys = ['url','date','name']
res = [nested_lookup(key,data) for key in keys]
df = pd.concat([pd.DataFrame(ent) for ent in res],axis=1)
df = df.set_axis(['url','date', 'entities.name'],axis='columns')
df
url date entities.name
0 http://www.newschannel6now.com/2018/12/31/cosb... 2018-12-31T23:46:18Z 2018
1 NaN NaN Actress
2 NaN NaN Bill Cosby
3 NaN NaN music superstar
Note how json_normalize associates every row with the url, while in the nested_lookup option, NaNs are propagated instead.
Upvotes: 1