Reputation: 23
I have a dataframe documenting when a product was added and removed from basket. However, the set_name
column contains two sets of information for the color set and the shape set. See below:
eff_date prod_id set_name change_type
0 20150414 20770 MONO COLOR SET ADD
1 20150414 20770 REC SHAPE SET ADD
2 20150429 132 MONO COLOR SET ADD
3 20150429 132 REC SHAPE SET ADD
4 20150521 199 MONO COLOR SET DROP
5 20150521 199 REC SHAPE SET DROP
6 20150521 199 TET SHAPE SET ADD
7 20150521 199 MONO COLOR SET ADD
I would like to split out the two sets of information contained in set_name
into columns color_set
and shape_set
and drop set_name
. so the previous df should look like:
eff_date prod_id change_type color_set shape_set
0 20150414 20770 ADD MONO COLOR SET REC SHAPE SET
1 20150429 132 ADD MONO COLOR SET REC SHAPE SET
2 20150521 199 DROP MONO COLOR SET REC SHAPE SET
3 20150521 199 ADD MONO COLOR SET TET SHAPE SET
I attempted first splitting out the columns in a for loop and then aggregating with groupby:
for index, row in df.iterrows():
if 'COLOR' in df.loc[index,'set_name']:
df.loc[index,'color_set'] = df.loc[index,'set_name']
if 'SHAPE' in df.loc[index,'set_name']:
df.loc[index,'shape_set'] = df.loc[index,'set_name']
df = df.fillna('')
df.groupby(['eff_date','prod_id','change_type']).agg({'color_set':sum,'shape_set':sum})
However this left me with a dataframe of only two columns and multi-level index that i wasn't sure how to unstack.
color_set shape_set
eff_date prod_id change_type
20150414 20770 ADD MONO COLOR SET REC SHAPE SET
20150429 132 ADD MONO COLOR SET REC SHAPE SET
20150521 199 DROP MONO COLOR SET REC SHAPE SET
ADD MONO COLOR SET TET SHAPE SET
Any help on this is greatly appreciated!
Upvotes: 2
Views: 52
Reputation: 51185
Your code looks fine apart from having to reset your index, but we can simplify it quite a bit (in particular remove the need for iterrows
which can be painfully slow, using a pivot
with a small trick to get your column names.
This answer assumes that you only have these two options in your column, if you have more categories, simply use numpy.select
instead of numpy.where
and define your conditions / outputs that way.
df['key'] = np.where(df['set_name'].str.contains('COLOR'), 'color_set', 'shape_set')
df.pivot_table(
index=['eff_date', 'prod_id', 'change_type'],
columns='key',
values='set_name',
aggfunc='first'
).reset_index()
key eff_date prod_id change_type color_set shape_set
0 20150414 20770 ADD MONO COLOR SET REC SHAPE SET
1 20150429 132 ADD MONO COLOR SET REC SHAPE SET
2 20150521 199 ADD MONO COLOR SET TET SHAPE SET
3 20150521 199 DROP MONO COLOR SET REC SHAPE SET
Upvotes: 1