Matt
Matt

Reputation: 161

Conditional sums based on grouped columns

Basically, I would like to create a column ERROR RATE that takes the sum of the YES's in ERROR and divides it by the NO's in ERROR, but to have that ERROR RATE filtered per LOCATION per DESCRIPTION.

df = pd.DataFrame({'ERROR' : ['YES','YES','NO','NO','YES'],
           'LOCATION' : ['London', 'New York', 'Edinburgh', 'Paris', 'Berlin'],
           'DESCRIPTION' : ['Good', 'Bad', 'Good', 'Bad', 'Bad']}) 

I've tried writing code along the lines:

df.groupby(['DESCRIPTION','LOCATION'])['ERROR'].agg()

But I'm unsure if a) This is even the correct approach and b) What kind of function I need to write for the agg()

Thanks in advance

Upvotes: 1

Views: 40

Answers (1)

cs95
cs95

Reputation: 402493

Maybe you're looking for groupby + value_counts + unstack + div?

i = df.groupby(['DESCRIPTION' ,'LOCATION']).ERROR.value_counts().unstack(-1)
j = (i.YES / i.NO).to_frame('ERROR_RATE')

j

                       ERROR_RATE
DESCRIPTION LOCATION             
Bad         Berlin            ...
            New York          ...
            Paris             ...
Good        Edinburgh         ...
            London            ...

If there are no counts for either YES or NO for a particular group, then the resulting columns will have NaN for those entries in the final result.


If you're using agg, you need a 3-way groupby.

i = df.groupby(['DESCRIPTION','LOCATION','ERROR']).ERROR.count().unstack()
j = (i.YES / i.NO).to_frame('ERROR_RATE')

j
                       ERROR_RATE
DESCRIPTION LOCATION             
Bad         Berlin            ...
            New York          ...
            Paris             ...
Good        Edinburgh         ...
            London            ...

count is the same as agg('count').

Upvotes: 2

Related Questions