drop group by number of occurrence

Hi I want to delete the rows with the entries whose number of occurrence is smaller than a number, for example:

df = pd.DataFrame({'a': [1,2,3,2], 'b':[4,5,6,7], 'c':[0,1,3,2]})
df

Here I want to delete all the rows if the number of occurrence in column 'a' is less than twice.
Wanted output:

   a  b  c
1  2  5  1
3  2  7  2

What I know: we can find the number of occurrence by condition = df['a'].value_counts() < 2, and it will give me something like:

2    False
3    True
1    True
Name: a, dtype: int64

But I don't know how I should approach from here to delete the rows.
Thanks in advance!

Upvotes: 3

Answers (3)

Reputation: 164773

res = df[df.groupby('a')['b'].transform('size') >= 2]

The transform method maps df.groupby('a')['b'].size() to df aligned with df['a'].

s = df['a'].value_counts()
res = df[df['a'].map(s) >= 2]

print(res)

   a  b  c
1  2  5  1
3  2  7  2

Upvotes: 2

Reputation: 2939

You could try something like this to get the length of each group, transform back to original index and index the df by it

df[df.groupby("a").transform(len)["b"] >= 2]


    a   b   c
1   2   5   1
3   2   7   2

Breaking it into individual steps you get:

df.groupby("a").transform(len)["b"]

0    1
1    2
2    1
3    2
Name: b, dtype: int64

These are the group sizes transformed back onto your original index

df.groupby("a").transform(len)["b"] >=2

0    False
1     True
2    False
3     True
Name: b, dtype: bool

We then turn this into the boolean index and index our original dataframe by it

Upvotes: 2

Reputation: 4526

You Can use df.where and the dropna

df.where(df['a'].value_counts() <2).dropna()

     a   b   c
1   2.0 5.0 1.0
3   2.0 7.0 2.0

Upvotes: 2