ners23
ners23

Reputation: 1

groupby and sum to calculate into new column and organized hierarchy

I'm trying to groupby my dataframe into a tree format so that it sorts down in a hierarchical way. DC being the first column that funnels down into Retailer, Store Count, Product descriptions, case volume and velocity in that order. Summing the retailer column into a new column "StoreCt" that is positioned after "Retailer"

The problem I'm running into is the store counts are being duplicated.

Here is the dataframe I have

Retailer DC Product Cs Volume Velocity
joe ABC bars Cs Cost Velocity
joe DFC drinks1 Cs Cost Velocity
joe DFC drinks2 Cs Cost Velocity
randy ABC bars Cs Cost Velocity
peter DFC drinks2 Cs Cost Velocity
john XYZ drinks Cs Cost Velocity
joe XYZ snacks Cs Cost Velocity
joe DFC bars2 Cs Cost Velocity

This is the result I want. values in the cs, volume, and velocity columns need to be unchanged

DC Retailer StoreCt Product Cs Volume Velocity
ABC joe 1 bars Cs Cost Velocity
randy 1 bars Cs Cost Velocity
DFC joe 3 drinks1 Cs Cost Velocity
drinks2 Cs Cost Velocity
bars2 Cs Cost Velocity
peter 1 drinks2 Cs Cost Velocity
XYZ joe 1 snacks Cs Cost Velocity
john 1 drinks Cs Cost Velocity

this is my code to get the store count, but i can't figure out how to add it into the dataframe without duplicating the values

store_count = df.groupby("Retailer").size().to_frame("StoreCt")
store_count

Upvotes: 0

Views: 43

Answers (1)

Corralien
Corralien

Reputation: 120419

Use transform to broadcast the result to all rows:

df['StoreCt'] = df.groupby(['DC', 'Retailer']).transform('size')
print(df)

# Output:
  Retailer   DC  Product  Cs Volume  Velocity  StoreCt
0      joe  ABC     bars  Cs   Cost  Velocity        1
1      joe  DFC  drinks1  Cs   Cost  Velocity        3
2      joe  DFC  drinks2  Cs   Cost  Velocity        3
3    randy  ABC     bars  Cs   Cost  Velocity        1
4    peter  DFC  drinks2  Cs   Cost  Velocity        1
5     john  XYZ   drinks  Cs   Cost  Velocity        1
6      joe  XYZ   snacks  Cs   Cost  Velocity        1
7      joe  DFC    bars2  Cs   Cost  Velocity        3

To get the output, you can reorder the columns:

cols = ['DC', 'Retailer', 'StoreCt', 'Product', 'Cs', 'Volume', 'Velocity']
df = df[cols].sort_values(['DC', 'Retailer'], ignore_index=True)
print(df)

# Output
    DC Retailer  StoreCt  Product  Cs Volume  Velocity
0  ABC      joe        1     bars  Cs   Cost  Velocity
1  ABC    randy        1     bars  Cs   Cost  Velocity
2  DFC      joe        3  drinks1  Cs   Cost  Velocity
3  DFC      joe        3  drinks2  Cs   Cost  Velocity
4  DFC      joe        3    bars2  Cs   Cost  Velocity
5  DFC    peter        1  drinks2  Cs   Cost  Velocity
6  XYZ      joe        1   snacks  Cs   Cost  Velocity
7  XYZ     john        1   drinks  Cs   Cost  Velocity

Upvotes: 1

Related Questions