zach
zach

Reputation: 33

Count how many rows within the same group have a larger value in a given column for each row in Pandas DataFrame

I have a pandas dataframe with a group field and variable of interest. For each row in the dataframe I want to count how many rows within the same group have a larger value for the variable of interest.

Below is an example of what I'm trying to achieve:

import pandas as pd
df = pd.DataFrame(data = [['a',1],['a',2],['a',2],['a',3],['b',4],['b',2],['b',6]],
                  columns = ['groups','value'])
df
  groups value
0   a      1
1   a      2
2   a      2
3   a      3
4   b      4
5   b      2
6   b      6

Here is the output I'm hoping to receive:

  groups value what_i_want
0   a      1        3
1   a      2        1
2   a      2        1
3   a      3        0
4   b      4        1
5   b      2        2
6   b      6        0

I know I could get to this answer by looping through each row of the dataframe, however I also know iterating through the rows of dataframe is a last resort and my full dataset is much bigger and that would take a long time to run. I'm assuming there is some way to do this using groupby or apply, but I can't figure it out.

Thanks!

Upvotes: 3

Views: 260

Answers (2)

BENY
BENY

Reputation: 323326

IIUC rank

(-df.value).groupby(df['groups']).rank(method='min')-1
Out[466]: 
0    3.0
1    1.0
2    1.0
3    0.0
4    1.0
5    2.0
6    0.0
Name: value, dtype: float64

#df['what i want']=(-df.value).groupby(df['groups']).rank(method='min')-1

Upvotes: 1

Code Different
Code Different

Reputation: 93181

Using numpy broadcasting for each group:

def summarize(group):
    v = group['value'].values
    other = v[:, None]
    count = (other > v).sum(axis=0)

    return pd.DataFrame({'what_i_want': count})

df.groupby('groups').apply(summarize)

Explanation

Let consider group a. We first extract the elements in values to a numpy array called v:

v = [1, 2, 2, 3] # ndarray of shape (4,)

We want to compare this array against itself orthogonally and count how many elements are greater than the current element. The [:, None] syntax is to raise v an extra dimension...

other = [[1], [2], [2], [3]] # ndarray of shape (4,1)

...so that the other > v operation is broadcastable and the comparison matrix looks like this:

other > v
            v:  [ 1   2   2   3 ]
other:  [
         [1]      F   F   F   F
         [2]      T   F   F   F
         [2]      T   F   F   F
         [3]      T   T   T   F
        ]
----------------------------------
sum(axis=0)       3   1   1   0

Upvotes: 1

Related Questions