Reputation: 2103
Let's say I am trying to find how many duplicates I have for a pair of values in a table. The columns are "A" and "B" I can do
select A, B, count(*) as counter from table group by A, B
In fact, I could also do
select A, B from (select A, B, count(*) as counter from table group by A, B) where counter >= 2
to only deal with values that have n duplicates.
How can I do the same in pandas?
I can do
df.groupby(["A", "B"].count(),
but that gives me every element, I only want to limit to those where count>=2
For example if I have:
A B C
0 x a 1
1 x a 1
2 x b 2
3 y b 3
4 y a 1
I want to identify the first two columns because groupby() gives count of 2 (the pair (x,a) is repeated). I would like to do the same for any value, not just 2.
Upvotes: 0
Views: 40
Reputation: 323326
Seems like you can do filter after groupby
df.groupby(["A", "B"])['A'].count().loc[lambda x : x>2]
Update duplicated
df[df.duplicated(['A','B'],keep=False)]
Out[1178]:
A B C
0 x a 1
1 x a 1
transform
for different n
n=2
df[df.groupby(['A','B'])['A'].transform('count')==n]
Upvotes: 1