Reputation: 1165
My Dataframe looks like this:
campaign_name campaign_id event_name clicks installs conversions
campaign_1 1234 registration 100 5 1
campaign_1 1234 hv_users_r 100 5 2
campaign_2 2345 registration 500 10 3
campaign_2 2345 hv_users_w 500 10 2
campaign_3 3456 registration 1000 50 10
campaign_4 3456 hv_users_r 1000 50 15
campaign_4 3456 hv_users_w 1000 50 25
I want to categorize all the "event names" into 2 new columns, where 1st new columns represents "registration", and the 2nd new column represents "hv_users", which will the sum of all rows having event-names of "hv_users_r" & "hv_users_w".
To keep this simple - "registration" column will have rows which only have event_name as "registration". All non "registration" event_names would go into the new column "hv_users".
This is my expected new Dataframe:
campaign_name campaign_id clicks installs registrations hv_users
campaign_1 1234 100 5 1 2
campaign_2 2345 500 10 3 2
campaign_3 3456 1000 50 10 40
Can someone please give me directions on how to go from the input DataFrame to the output DataFrame?
Upvotes: 1
Views: 142
Reputation: 38415
Try using pivot_table
df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)
campaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0
Upvotes: 0
Reputation: 241
df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)
Upvotes: 1
Reputation: 57
pd.crosstab() and pd.pivot() should do the trick.
#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)
Upvotes: 0
Reputation: 323226
You can using split
+ join
, then groupby
+unstack
df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
unstack(fill_value=0).reset_index()
Out[302]:
event_name ampaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0
Upvotes: 1