melodyduany
melodyduany

Reputation: 23

Aggregating rows in python pandas dataframe

I have a dataframe documenting when a product was added and removed from basket. However, the set_name column contains two sets of information for the color set and the shape set. See below:

   eff_date  prod_id   set_name         change_type           
0  20150414  20770     MONO COLOR SET   ADD             
1  20150414  20770     REC SHAPE SET    ADD         
2  20150429  132       MONO COLOR SET   ADD                
3  20150429  132       REC SHAPE SET    ADD        
4  20150521  199       MONO COLOR SET   DROP
5  20150521  199       REC SHAPE SET    DROP
6  20150521  199       TET SHAPE SET    ADD
7  20150521  199       MONO COLOR SET   ADD

I would like to split out the two sets of information contained in set_name into columns color_set and shape_set and drop set_name. so the previous df should look like:

   eff_date  prod_id   change_type  color_set       shape_set     
0  20150414  20770     ADD          MONO COLOR SET  REC SHAPE SET          
1  20150429  132       ADD          MONO COLOR SET  REC SHAPE SET
2  20150521  199       DROP         MONO COLOR SET  REC SHAPE SET
3  20150521  199       ADD          MONO COLOR SET  TET SHAPE SET

I attempted first splitting out the columns in a for loop and then aggregating with groupby:

for index, row in df.iterrows():
    if 'COLOR' in df.loc[index,'set_name']:
        df.loc[index,'color_set'] = df.loc[index,'set_name']
    if 'SHAPE' in df.loc[index,'set_name']:
        df.loc[index,'shape_set'] = df.loc[index,'set_name']
df = df.fillna('')
df.groupby(['eff_date','prod_id','change_type']).agg({'color_set':sum,'shape_set':sum})

However this left me with a dataframe of only two columns and multi-level index that i wasn't sure how to unstack.

                                color_set       shape_set
eff_date  prod_id  change_type 
20150414  20770    ADD          MONO COLOR SET  REC SHAPE SET
20150429  132      ADD          MONO COLOR SET  REC SHAPE SET
20150521  199      DROP         MONO COLOR SET  REC SHAPE SET
                   ADD          MONO COLOR SET  TET SHAPE SET

Any help on this is greatly appreciated!

Upvotes: 2

Views: 52

Answers (1)

user3483203
user3483203

Reputation: 51185

Your code looks fine apart from having to reset your index, but we can simplify it quite a bit (in particular remove the need for iterrows which can be painfully slow, using a pivot with a small trick to get your column names.

This answer assumes that you only have these two options in your column, if you have more categories, simply use numpy.select instead of numpy.where and define your conditions / outputs that way.


df['key'] = np.where(df['set_name'].str.contains('COLOR'), 'color_set', 'shape_set')

df.pivot_table(
  index=['eff_date', 'prod_id', 'change_type'],
  columns='key',
  values='set_name',
  aggfunc='first'
).reset_index()

key  eff_date  prod_id change_type       color_set      shape_set
0    20150414    20770         ADD  MONO COLOR SET  REC SHAPE SET
1    20150429      132         ADD  MONO COLOR SET  REC SHAPE SET
2    20150521      199         ADD  MONO COLOR SET  TET SHAPE SET
3    20150521      199        DROP  MONO COLOR SET  REC SHAPE SET

Upvotes: 1

Related Questions