Reputation: 33
I have a pandas dataframe with a group field and variable of interest. For each row in the dataframe I want to count how many rows within the same group have a larger value for the variable of interest.
Below is an example of what I'm trying to achieve:
import pandas as pd
df = pd.DataFrame(data = [['a',1],['a',2],['a',2],['a',3],['b',4],['b',2],['b',6]],
columns = ['groups','value'])
df
groups value
0 a 1
1 a 2
2 a 2
3 a 3
4 b 4
5 b 2
6 b 6
Here is the output I'm hoping to receive:
groups value what_i_want
0 a 1 3
1 a 2 1
2 a 2 1
3 a 3 0
4 b 4 1
5 b 2 2
6 b 6 0
I know I could get to this answer by looping through each row of the dataframe, however I also know iterating through the rows of dataframe is a last resort and my full dataset is much bigger and that would take a long time to run. I'm assuming there is some way to do this using groupby or apply, but I can't figure it out.
Thanks!
Upvotes: 3
Views: 260
Reputation: 323326
IIUC rank
(-df.value).groupby(df['groups']).rank(method='min')-1
Out[466]:
0 3.0
1 1.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
Name: value, dtype: float64
#df['what i want']=(-df.value).groupby(df['groups']).rank(method='min')-1
Upvotes: 1
Reputation: 93181
Using numpy broadcasting for each group:
def summarize(group):
v = group['value'].values
other = v[:, None]
count = (other > v).sum(axis=0)
return pd.DataFrame({'what_i_want': count})
df.groupby('groups').apply(summarize)
Let consider group a
. We first extract the elements in values
to a numpy array called v
:
v = [1, 2, 2, 3] # ndarray of shape (4,)
We want to compare this array against itself orthogonally and count how many elements are greater than the current element. The [:, None]
syntax is to raise v
an extra dimension...
other = [[1], [2], [2], [3]] # ndarray of shape (4,1)
...so that the other > v
operation is broadcastable and the comparison matrix looks like this:
other > v
v: [ 1 2 2 3 ]
other: [
[1] F F F F
[2] T F F F
[2] T F F F
[3] T T T F
]
----------------------------------
sum(axis=0) 3 1 1 0
Upvotes: 1