eugene
eugene

Reputation: 41665

pandas, compute a value per group?

I'm trying to go from df to df2
I'm grouping by review_meta_id, age_bin then calculate a ctr from sum(click_count)/ sum(impression_count)

In [69]: df
Out[69]:
   review_meta_id  age_month  impression_count  click_count age_bin
0               3          4                10            3       1
1               3         10                 5            2       2
2               3         20                 5            3       3
3               3          8                 9            2       2
4               4          9                 9            5       2

In [70]: df2
Out[70]:
   review_meta_id       ctr  age_bin
0               3  0.300000        1
1               3  0.285714        2
2               3  0.600000        3
3               4  0.555556        2



import pandas as pd

bins = [0, 5, 15, 30]
labels = [1,2,3]

l = [dict(review_meta_id=3, age_month=4, impression_count=10, click_count=3), dict(review_meta_id=3, age_month=10, impression_count=5, click_count=2), dict(review_meta_id=3, age_month=20, impression_count=5, cli\
ck_count=3), dict(review_meta_id=3, age_month=8, impression_count=9, click_count=2), dict(review_meta_id=4, age_month=9, impression_count=9, click_count=5)]

df = pd.DataFrame(l)
df['age_bin'] = pd.cut(df['age_month'], bins=bins, labels=labels)


grouped = df.groupby(['review_meta_id', 'age_bin'])

Is there an elegant way of doing the following?

data = []
for name, group in grouped:
    ctr = group['click_count'].sum() / group['impression_count'].sum()
    review_meta_id, age_bin = name
    data.append(dict(review_meta_id=review_meta_id, ctr=ctr, age_bin=age_bin))


df2 = pd.DataFrame(data)

Upvotes: 2

Views: 572

Answers (2)

Azzedine
Azzedine

Reputation: 505

you can use apply function after you grouping the dataframe by 'review_meta_id', 'age_bin' in order to calculate 'ctr', the result will be a pandas series in order to convert it to a dataframe we use reset_index() and provide name='ctr', The name of the column corresponding to the Series values.

def divide_two_cols(df_sub):
    return df_sub['click_count'].sum() / float(df_sub['impression_count'].sum())

df2 = df.groupby(['review_meta_id', 'age_bin']).apply(divide_two_cols).reset_index(name='ctr')
new_df

Upvotes: 0

jezrael
jezrael

Reputation: 862641

You can first aggregate goth columns by sum, then divide columns with DataFrame.pop for use and remove columns and last convert MultiIndex to columns with remove rows with missing values by DataFrame.dropna:

df2 = df.groupby(['review_meta_id', 'age_bin'])[['click_count','impression_count']].sum()
df2['ctr'] = df2.pop('click_count') / df2.pop('impression_count')
df2 = df2.reset_index().dropna()
print (df2)
   review_meta_id age_bin       ctr
0               3       1  0.300000
1               3       2  0.285714
2               3       3  0.600000
4               4       2  0.555556

Upvotes: 2

Related Questions