Groupby and Value Counting categories

Question

I have a dataframe in which each line represents product sale. These are linked to order # (which can have multiple products) with price and color for each. I need to group these by Order # and get a column that counts each product type for that order row.

df = pd.DataFrame({'Product': ['X','X','Y','X','Y','W','W','Z','W','X'],
                   'Order #': ['01','01','02','03','03','03','04','05','05','05'],
                   'Price': [100,100,650,50,700,3000,2500,10,2500,150],
                   'Color': ['RED','BLUE','RED','RED','BLUE','GREEN','RED','BLUE','BLUE','GREEN']})

A 'regular' group-by expression using count is not what i am looking for.

# Aggregate 
ag_func = {'Product Quant.': pd.NamedAgg(column='Product', aggfunc='count'),
           'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
           'Color Quant.': pd.NamedAgg(column='Color', aggfunc='count')}

# Test
test = df.groupby(pd.Grouper(key='Order #')).agg(**ag_func).reset_index()

I can solve this issue by using get_dummies for each category (product / color) and then using the sum aggregate function. This is fine for smaller datasets but in my real world case there are many dozens of categories, and new sets coming in with different categories all together...

This is the 'solution' i came up with

# Dummy 
df_dummy = pd.get_dummies(df, prefix='Type', prefix_sep=': ', columns=['Product', 'Color'])

ag_func2 = {'Product Quant.': pd.NamedAgg(column='Order #', aggfunc='count'),
            'W total': pd.NamedAgg(column='Type: W', aggfunc='sum'),
            'X total': pd.NamedAgg(column='Type: X', aggfunc='sum'),
            'Y total': pd.NamedAgg(column='Type: Y', aggfunc='sum'),
            'Z total': pd.NamedAgg(column='Type: Z', aggfunc='sum'),
            'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
            'Color BLUE': pd.NamedAgg(column='Type: BLUE', aggfunc='sum'),
            'Color GREEN': pd.NamedAgg(column='Type: GREEN', aggfunc='sum'),
            'Color RED': pd.NamedAgg(column='Type: RED', aggfunc='sum')}

solution = df_dummy.groupby(pd.Grouper(key='Order #')).agg(**ag_func2).reset_index()

Note the 2 X products on the 1st row and the 2 BLUES on the 5th row. This behaviour is what i need but this is too convoluted for repeated use on multiple datasets. I've tried to use pivot_tables but with no success.

Should i just define a function to go through categorical columns, dummy those and then group-by a set column using sum aggregation for the dummy variables?

Thanks

Ben.T · Accepted Answer

IIUC your problem being typing all the pd.NamedAgg for all the dummies created, maybe you can do your operations separately. First create the group object, then concat different operations on the different columns.

gr = df.groupby('Order #')
res = pd.concat([
    # equivalent to count the orders
    gr.size().to_frame(name='Product Quant.'),
    # equivalent to dummy then sum the dummy product columns
    gr['Product'].value_counts().unstack(fill_value=0).add_suffix(' Total'),
    # sum the price to get the total
    gr['Price'].sum().to_frame(name='Total Price'),
    # equivalent to sum the dummy color columns
    gr['Color'].value_counts().unstack(fill_value=0).add_prefix('Color ')
], axis=1)

print(res)
         Product Quant.  W Total  X Total  Y Total  Z Total  Total Price  \
Order #                                                                    
01                    2        0        2        0        0          200   
02                    1        0        0        1        0          650   
03                    3        1        1        1        0         3750   
04                    1        1        0        0        0         2500   
05                    3        1        1        0        1         2660   

         Color BLUE  Color GREEN  Color RED  
Order #                                      
01                1            0          1  
02                0            0          1  
03                1            1          1  
04                0            0          1  
05                2            1          0

So basically in this case, groupby.value_counts.unstack is equivalent to get_dummies.groupby.sum.

And for further use, instead of

df.groupby('Order #')['Product'].value_counts().unstack(fill_value=0)

you can do a pivot_table to get the same result:

df.pivot_table(index='Order #', columns='Product', aggfunc='size', fill_value=0)

Groupby and Value Counting categories

Answers (1)

Related Questions