Reputation: 912
I have the following JSON snippet:
{'search_metadata': {'completed_in': 0.027,
'count': 2},
'statuses': [{'contributors': None,
'coordinates': None,
'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
'text': 'The text',
'truncated': True,
'user': {'contributors_enabled': False,
'screen_name': 'abcde',
'verified': false
}
}
,{...}]
}
The info that interests me is all in the statuses
array. With pandas I can turn this into a DataFrame like this
df = pd.DataFrame(Data['statuses'])
Then I extract a subset out of this dataframe with
dfsub = df[['created_at', 'text']]
display(dfsub)
shows exactly what I expect.
But I also want to include [user][screen_name]
to the subset.
dfs = df[[ 'user', 'created_at', 'text']]
is syntactically correct but user
contains to much information.
How do I add only the screen_name
to the subset?
I have tried things like the following but none of that works
[user][screen_name]
user.screen_name
user:screen_name
Upvotes: 1
Views: 82
Reputation: 26251
You can use pd.Series.str
. The docs don't do justice to all the wonderful things .str
can do, such as accessing list
and dict
items. Case in point, you can access dict
elements like this:
df['user'].str['screen_name']
That said, I agree with @VladimirGromes that a better way is to normalize your data into a flat table.
Upvotes: 0
Reputation: 421
You can try to access Dataframe, then Series, then Dict
df['user'] # user column = Series
df['user'][0] # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict
Upvotes: 0
Reputation: 106
I would normalize data before contructing DataFrame. Take a look here: https://stackoverflow.com/a/41801708/14596032
Working example as an answer for your question:
df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)
Upvotes: 3