Reputation: 912

Basic pandas dataframe manipulation question

I have the following JSON snippet:

{'search_metadata': {'completed_in': 0.027,
                     'count': 2},
 'statuses': [{'contributors': None,
               'coordinates': None,
               'created_at': 'Wed Mar 31 19:25:16 +0000 2021',
               'text': 'The text',
               'truncated': True,
               'user': {'contributors_enabled': False,
                        'screen_name': 'abcde',
                        'verified': false
                        }
               }
               ,{...}]
}

The info that interests me is all in the statuses array. With pandas I can turn this into a DataFrame like this

df = pd.DataFrame(Data['statuses'])

Then I extract a subset out of this dataframe with

dfsub = df[['created_at', 'text']]

display(dfsub) shows exactly what I expect.

But I also want to include [user][screen_name] to the subset.

dfs = df[[ 'user', 'created_at', 'text']]

is syntactically correct but user contains to much information.

How do I add only the screen_name to the subset? I have tried things like the following but none of that works

[user][screen_name]
user.screen_name
user:screen_name

Upvotes: 1

Answers (3)

Pierre D

Reputation: 26251

You can use pd.Series.str. The docs don't do justice to all the wonderful things .str can do, such as accessing list and dict items. Case in point, you can access dict elements like this:

df['user'].str['screen_name']

That said, I agree with @VladimirGromes that a better way is to normalize your data into a flat table.

Upvotes: 0

Simon

Reputation: 421

You can try to access Dataframe, then Series, then Dict

df['user']                   # user column = Series
df['user'][0]                # 1st (only) item of the Series = dict
df['user'][0]['screen_name'] # screen_name in dict

Upvotes: 0

Vladimir Gromes

Reputation: 106

I would normalize data before contructing DataFrame. Take a look here: https://stackoverflow.com/a/41801708/14596032

Working example as an answer for your question:

df = pd.json_normalize(Data['statuses'], sep='_')
dfs = df[[ 'user_screen_name', 'created_at', 'text']]
print(dfs)

Upvotes: 3

Basic pandas dataframe manipulation question

Answers (3)

Related Questions