Reputation: 1998
I have a df such as
ID | Status | Color
555 Cancelled Green
434 Processed Red
212 Cancelled Blue
121 Cancelled Green
242 Cancelled Blue
352 Processed Green
343 Processed Blue
I am using code such as below:
cc = df.groupby(by='Color').ID.count()
df.groupby(by=['Color', 'Status']).apply(lambda x: len(x)/cc.loc[x.Color.iloc[0]])
This gives me output such as
Color Status
Blue Cancelled 0.666667
Processed 0.333333
Green Cancelled 0.666667
Processed 0.333333
Red Processed 1.000000
dtype: float64
Which gives me percentage of the status of each color.
There is also a field called dollar_value where each row contains dollar amounts, if I want to add two fields to my output of 1. Total_Dollars for that color and status and 2. the dollar_per_order for that color ( meaning if the Total_Dollars is 1000 and there are 200 rows for that color and status it would be 1000/200 or 5. Can I easily add both of these calculations to my output that I already have? Or would I need to create a function?
Desired Output:
Color Status Total |Dollar_Per_Order
Blue Cancelled 0.666667 1000 20
Processed 0.333333 200 5
Green Cancelled 0.666667 2000 20
Processed 0.333333 1000 5
Red Processed 1.000000 300 10
dtype: float64
Thanks!
Upvotes: 1
Views: 59
Reputation: 30991
To compute all 3 columns, define a function to be applied to each group as:
def fn(grp):
total = grp.dollar_value.sum()
rowNo = len(grp.index)
return pd.Series([ rowNo/cc[grp.name[0]], total, total/rowNo ],
index=[ 'Percentage', 'Total_Dollars', 'Dollar_per_order'])
Then apply it:
df.groupby(by=['Color', 'Status']).apply(fn)
Note that I used len(grp.index)
instead of len(grp)
.
The reason is that it runs a little quicker.
I also read the Color of the current group other way than you.
Upvotes: 1