Reputation: 833
Suppose I have a DF
DF1:
IDField | Frame | Order
--------|-------|------
20 | 10_01 | 0
20 | 10_01 | 0
20 | 10_01 | 1
20 | 10_02 | 2
5 | 02_01 | 0
5 | 02_01 | 1
I want to find how many times IDField
20
in Frame
10_01
has duplicates in Order
. In this example, Order
is duplicated once. I'm not concerned with anything that is not a duplicate. I'd like my final output to look like this:
DF2:
IDField | Frame | Order | Duplicates
--------|-------|-------|-----------
20 | 10_01 | 0 | 1
I've tried grouping by and counting like so:
df2 = df1.groupby(['IDField', 'Frame', 'Order']).size().reset_index(name='Duplicates')
While this does give me a duplicate count, it gives me a count for everything, not just the duplicates. Is there an elegant way to one line this? Or do I have to create a dataframe
and process it afterwards?
Upvotes: 1
Views: 3025
Reputation: 402553
You'll need an initial filtering step with duplicated
, then you can group on the columns and compute the size
:
c = ['IDField', 'Frame', 'Order']
df[df.duplicated(c)].groupby(c).size().reset_index(name='Duplicates')
IDField Frame Order Duplicates
0 20 10_01 0 1
c = ['IDField', 'Order']
df[df.duplicated(c)].groupby(c).size().reset_index(name='Duplicates')
IDField Order Duplicates
0 20 0 1
Upvotes: 2