Reputation: 696
I would like to get the number of duplicates for article_id
for each merchant_id
, where the zip_code
is not identical. Please see example below:
Table
merchant_id article_id zip_code
1 4555 1000
1 4555 1003
1 4555 1002
1 3029 1000
2 7539 1005
2 7539 1005
2 7539 1002
2 1232 1006
3 5555 1000
3 5555 1001
3 5555 1002
3 5555 1003
Output Table
merchant_id count_duplicate
1 3
2 2
3 4
So far I was able to return all duplicate rows - see code below:
df[df.duplicated('product_id',keep=False)==True
Upvotes: 0
Views: 160
Reputation: 42916
We can use groupby
with nunique
and then filter (query
) only the groups > 1
. nunique
stands for "number of unique values".
dfn = (
df.groupby(['merchant_id', 'article_id'])['zip_code'].nunique()
.reset_index(name='count_duplicate')
.query('count_duplicate > 1')
)
merchant_id article_id count_duplicate
1 1 4555 3
3 2 7539 2
4 3 5555 4
Upvotes: 4