Reputation: 73
I want to do aggregations on a panda dataframe by word.
Basically there are 3 columns with the click/impression count with the corresponding phrase. I would like to split the phrase into tokens and then sum up their clicks to tokens to decide which token is relatively good/bad.
Expected input: Panda dataframe as below
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
Expected output:
click_count impression_count token
1 30 300 pizza // 30 = 20 + 10, 300 = 200+100
2 21 201 italian // 21 = 20 + 1
3 1 1 cheese // cheese only appeared once in italian cheese
Upvotes: 3
Views: 416
Reputation: 77027
You could do
In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1)
In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum()
Out[3092]:
token click_count impression_count
0 pizza 30 300
1 italian 21 201
2 cheese 1 1
Details
In [3093]: df
Out[3093]:
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
In [3094]: s
Out[3094]:
1 pizza
2 pizza
2 italian
3 italian
3 cheese
dtype: object
Upvotes: 0
Reputation:
This creates a new DataFrame like piRSquared's but tokens are stacked and merged with the original:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True)
.groupby('token')['click_count', 'impression_count'].sum())
Out:
click_count impression_count
token
cheese 1 1
italian 21 201
pizza 30 300
If you break this down, it merges this:
df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token')
Out:
token
1 pizza
2 pizza
2 italian
3 italian
3 cheese
with the original DataFrame on their indices. The resulting df is:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True))
Out:
token click_count impression_count text
1 pizza 10 100 pizza
2 pizza 20 200 pizza italian
2 italian 20 200 pizza italian
3 italian 1 1 italian cheese
3 cheese 1 1 italian cheese
The rest is grouping by the token column.
Upvotes: 1
Reputation: 294506
tokens = df.text.str.split(expand=True)
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])]
tokens.columns = token_cols
df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1)
df1
df2 = pd.lreshape(df1, {'tokens': token_cols})
df2
df2.groupby('tokens').sum()
Upvotes: 1