Shubham R
Shubham R

Reputation: 7644

Filter rows after groupby pandas

I have a table in pandas:

import pandas as pd

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID  count       pidx     pidy   score
0   1       10           10        20     10
1   1       20           10        20     10
2   2       30          300       400     10
3   1       40           10        20     22
4   3       80           30        15     22
5   3       10           40        20      3
6   1       20           20        12      4
7   6       50           10        43      5
8   3       30           20        54      9
9   5       10           45       112      0
10  1       70           20        23      1

I want to do a groupby and then filter the rows where occurrence of pidx is greater than 2.

That is, filter rows where pidx is 10 and 20.

I tried using df.groupby('pidx').count() but it didn't helped me. Also for those rows I have to do 0.4*count+0.6*score.

Desired output is:

LeafID    count       pidx     pidy    final_score
   1       10           10        20
   1       20           10        20
   1       40           10        20
   6       50           10        43
   1       20           20        12
   3       30           20        54
   1       70           20        23

Upvotes: 7

Views: 16746

Answers (4)

piRSquared
piRSquared

Reputation: 294506

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Upvotes: 3

V Shreyas
V Shreyas

Reputation: 459

First of all, your output shows you don't want to do a groupby. Read up on what groupby does. What you need is:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you your exact result. Read up on pandas indexing and functions. In fact go and read the whole introduction on pandas. It will not take much time.

Row operations are also simple using indexing:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']

Upvotes: 0

jezrael
jezrael

Reputation: 863511

You can use value_counts with boolean indexing and isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Timings:

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_score you can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

Upvotes: 7

Ted Petrou
Ted Petrou

Reputation: 62037

This is a straightforward application of filter after doing a groupby. In the data you provided, a value of 20 for pidx only occurred twice so it was filtered out.

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

Upvotes: 11

Related Questions