MaxB
MaxB

Reputation: 458

Remove all groups with more than N observations

If a value occurs more than two times in a column I want to drop every row that it occurs in.

The input df would look like:

Name   Num
  X     1
  X     2
  Y     3
  Y     4
  X     5

The output df would look like:

Name   Num
  Y     3
  Y     4

I know it is possible to remove duplicates, but that only works if I want to remove the first or last duplicate that is found, not the nth duplicate.

df = df.drop_duplicates(subset = ['Name'], drop='third')

This code is completely wrong but it helps explain what I was trying to do.

Upvotes: 0

Views: 164

Answers (2)

jezrael
jezrael

Reputation: 862571

Use GroupBy.cumcount for counter and filter all values less like 2:

df1 = df[df.groupby('Name').cumcount() < 3]
print (df1)
  Name  Num
0    X    1
1    X    2
2    Y    3
3    Y    4

Detail:

print (df.groupby('Name').cumcount())
0    0
1    1
2    0
3    1
4    2
dtype: int64

EDIT

Filter by GroupBy.transform and GroupBy.size:

df1 = df[df.groupby('Name')['Num'].transform('size') < 3]
print (df1)
  Name  Num
2    Y    3
3    Y    4

Upvotes: 4

BENY
BENY

Reputation: 323226

Using head

df.groupby('Name').head(2)
Out[375]: 
  Name  Num
0    X    1
1    X    2
2    Y    3
3    Y    4

s=df.groupby('Name').size()<=2
df.loc[df.Name.isin(s[s].index)]
Out[380]: 
  Name  Num
2    Y    3
3    Y    4

Upvotes: 4

Related Questions