Reputation: 180
I have a dataframe in which each line represents product sale. These are linked to order # (which can have multiple products) with price and color for each. I need to group these by Order # and get a column that counts each product type for that order row.
df = pd.DataFrame({'Product': ['X','X','Y','X','Y','W','W','Z','W','X'],
'Order #': ['01','01','02','03','03','03','04','05','05','05'],
'Price': [100,100,650,50,700,3000,2500,10,2500,150],
'Color': ['RED','BLUE','RED','RED','BLUE','GREEN','RED','BLUE','BLUE','GREEN']})
A 'regular' group-by expression using count is not what i am looking for.
# Aggregate
ag_func = {'Product Quant.': pd.NamedAgg(column='Product', aggfunc='count'),
'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
'Color Quant.': pd.NamedAgg(column='Color', aggfunc='count')}
# Test
test = df.groupby(pd.Grouper(key='Order #')).agg(**ag_func).reset_index()
I can solve this issue by using get_dummies for each category (product / color) and then using the sum aggregate function. This is fine for smaller datasets but in my real world case there are many dozens of categories, and new sets coming in with different categories all together...
This is the 'solution' i came up with
# Dummy
df_dummy = pd.get_dummies(df, prefix='Type', prefix_sep=': ', columns=['Product', 'Color'])
ag_func2 = {'Product Quant.': pd.NamedAgg(column='Order #', aggfunc='count'),
'W total': pd.NamedAgg(column='Type: W', aggfunc='sum'),
'X total': pd.NamedAgg(column='Type: X', aggfunc='sum'),
'Y total': pd.NamedAgg(column='Type: Y', aggfunc='sum'),
'Z total': pd.NamedAgg(column='Type: Z', aggfunc='sum'),
'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
'Color BLUE': pd.NamedAgg(column='Type: BLUE', aggfunc='sum'),
'Color GREEN': pd.NamedAgg(column='Type: GREEN', aggfunc='sum'),
'Color RED': pd.NamedAgg(column='Type: RED', aggfunc='sum')}
solution = df_dummy.groupby(pd.Grouper(key='Order #')).agg(**ag_func2).reset_index()
Note the 2 X products on the 1st row and the 2 BLUES on the 5th row. This behaviour is what i need but this is too convoluted for repeated use on multiple datasets. I've tried to use pivot_tables but with no success.
Should i just define a function to go through categorical columns, dummy those and then group-by a set column using sum aggregation for the dummy variables?
Thanks
Upvotes: 2
Views: 828
Reputation: 29635
IIUC your problem being typing all the pd.NamedAgg
for all the dummies created, maybe you can do your operations separately. First create the group object, then concat
different operations on the different columns.
gr = df.groupby('Order #')
res = pd.concat([
# equivalent to count the orders
gr.size().to_frame(name='Product Quant.'),
# equivalent to dummy then sum the dummy product columns
gr['Product'].value_counts().unstack(fill_value=0).add_suffix(' Total'),
# sum the price to get the total
gr['Price'].sum().to_frame(name='Total Price'),
# equivalent to sum the dummy color columns
gr['Color'].value_counts().unstack(fill_value=0).add_prefix('Color ')
], axis=1)
print(res)
Product Quant. W Total X Total Y Total Z Total Total Price \
Order #
01 2 0 2 0 0 200
02 1 0 0 1 0 650
03 3 1 1 1 0 3750
04 1 1 0 0 0 2500
05 3 1 1 0 1 2660
Color BLUE Color GREEN Color RED
Order #
01 1 0 1
02 0 0 1
03 1 1 1
04 0 0 1
05 2 1 0
So basically in this case, groupby.value_counts.unstack
is equivalent to get_dummies.groupby.sum
.
And for further use, instead of
df.groupby('Order #')['Product'].value_counts().unstack(fill_value=0)
you can do a pivot_table
to get the same result:
df.pivot_table(index='Order #', columns='Product', aggfunc='size', fill_value=0)
Upvotes: 2