python
python

Reputation: 13

pandas groupby and update the sum of the number of times the values in one column is greater than the other column

I have a dataset in the following format

df = pd.DataFrame([[1, 'Label1', 0, 8, 2], [1, 'Label3', 0, 20, 5], [2, 'Label5', 1, 20, 2], [2, 'Label4', 1, 11, 0], 
              [5, 'Label2', 0, 0, -4],[1, 'Label2', 1, 8, 2], [2, 'Label5', 0, 20, 5], [3, 'Label2', 1, 20, 2], [4, 'Label4', 0, 1, 0], 
              [5, 'Label3', 0, 1, -4],[1, 'Label3', 1, 8, 2], [2, 'Label4', 0, 20, 5], [3, 'Label1', 1, 20, 2], [4, 'Label3', 0, 1, 0], 
              [5, 'Label4', 0, 1, -4],[1, 'Label4', 1, 8, 2], [2, 'Label3', 0, 20, 5], [3, 'Label3', 1, 20, 2], [4, 'Label5', 0, 1, 0], 
              [5, 'Label5', 0, 1, -4]],
              columns=['ID', 'Label', 'Status', 'Coeff', 'result'])

cm = {'TP': 0,'FP': 0}

For each ID in df, I would like to find the number of times the column Coeff is greater than Result when Status column is 1. If this count is greater than 3 then TP should be incremented by 1 and if it less than 3, then FP should be incremented by 1.

Example: When ID is 1111 and Status 1, if the Coeff column is greater than Result column twice for that particular ID, then FP must be incremented by 1.

I tried to add a new column called count for each ID and assigned a value as 1 everytime the column Coeff was greater than Result.

for ID in df.groupby('ID'): 
  df.loc[(df['Coeff'] > df['Result']), 'count'] = 1
  df_new = list(df[['ID','count']].groupby(df['ID']))

Then I thought of finding whether count has the number 1 in it. If it does, then increment TP. Otherwise, increment FP.

But I couldn't achieve it.

How do I get the required result?

Upvotes: 1

Views: 202

Answers (1)

cs95
cs95

Reputation: 402844

A simple grouping operation on a masked comparison should do:

v = df.Coeff.gt(df.result).where(df.Status.astype(bool)).groupby(df.ID).sum()

Or (to retain dtype=int, thanks piR!),

v = df.Coeff.gt(df.result).where(df.Status.astype(bool), 0).groupby(df.ID).sum()

v   # second expression result

ID
1    3
2    2
3    3
4    0
5    0
dtype: int64

Now,

cm['TP'] = v.gt(3).sum()
cm['FP'] = v.lt(3).sum()

Details
df.Coeff.gt(df.result) returns a mask. Now, hide all those values for which df.Status is not 1. This is done using (df.Coeff > df.result).where(df.Status.astype(bool)). Finally, take this masked result, and group on ID, followed by a sum to get your result.

Upvotes: 1

Related Questions