Reputation: 363
I have a large data frame (40M rows) and I want to filter out rows based on one column if the value meets a condition in a groupby object.
For example, here is some random data. The 'letter' column would actually have thousands of unique values:
x y z letter
0 47 86 30 e
1 58 9 28 b
2 96 59 42 a
3 79 6 45 e
4 77 80 37 d
5 66 91 35 d
6 96 31 52 d
7 56 8 26 e
8 78 96 14 a
9 22 60 13 e
10 75 82 9 d
11 5 54 29 c
12 83 31 40 e
13 37 70 2 c
14 53 67 66 a
15 76 33 78 d
16 64 67 81 b
17 23 94 1 d
18 10 1 31 e
19 52 11 3 d
Apply a groupby on the 'letter' column, and get the sum of column x for each letter:
df.groupby('letter').x.sum()
>>> a 227
b 122
c 42
d 465
e 297
Then, I sort to see the letters with the highest sum, and manually identify a threshold. In this example the threshold might be 200.
df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False)
>>> letter x
3 d 465
4 e 297
0 a 227
1 b 122
2 c 42
Here's where I am stuck. In the original dataframe, I want to keep letters if the groupby sum of column 'x' > 200, and drop the other rows. So in this example, it would keep all the rows with d, e or a.
I was trying something like this but it doesn't work:
df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)
And even if I filter the groupby object, how do I use it to filter the original dataframe?
Upvotes: 13
Views: 8958
Reputation: 214987
You can use groupby
transform
to calculate a the sum of x for each row and create a logical series with the condition with which you can do the subset:
df1 = df[df.x.groupby(df.letter).transform('sum') > 200]
df1.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
Or another option using groupby.filter
:
df2 = df.groupby('letter').filter(lambda g: g.x.sum() > 200)
df2.letter.unique()
# array(['e', 'a', 'd'], dtype=object)
Upvotes: 12