AGH_TORN
AGH_TORN

Reputation: 833

Count number of duplicates within each group

Suppose I have a DF

DF1:
IDField | Frame | Order
--------|-------|------
   20   | 10_01 |   0
   20   | 10_01 |   0
   20   | 10_01 |   1
   20   | 10_02 |   2
   5    | 02_01 |   0
   5    | 02_01 |   1

I want to find how many times IDField 20 in Frame 10_01 has duplicates in Order. In this example, Order is duplicated once. I'm not concerned with anything that is not a duplicate. I'd like my final output to look like this:

DF2:
IDField | Frame | Order | Duplicates
--------|-------|-------|-----------
   20   | 10_01 |   0   |     1

I've tried grouping by and counting like so:

df2 = df1.groupby(['IDField', 'Frame', 'Order']).size().reset_index(name='Duplicates')

While this does give me a duplicate count, it gives me a count for everything, not just the duplicates. Is there an elegant way to one line this? Or do I have to create a dataframe and process it afterwards?

Upvotes: 1

Views: 3025

Answers (1)

cs95
cs95

Reputation: 402553

You'll need an initial filtering step with duplicated, then you can group on the columns and compute the size:

c = ['IDField', 'Frame', 'Order']
df[df.duplicated(c)].groupby(c).size().reset_index(name='Duplicates')

   IDField  Frame  Order  Duplicates
0       20  10_01      0           1

c = ['IDField', 'Order']
df[df.duplicated(c)].groupby(c).size().reset_index(name='Duplicates')

   IDField  Order  Duplicates
0       20      0           1

Upvotes: 2

Related Questions