Find duplicates in pandas with condition (Python)

Question

I would like to get the number of duplicates for article_id for each merchant_id, where the zip_code is not identical. Please see example below:

Table

merchant_id     article_id   zip_code 
1               4555         1000
1               4555         1003
1               4555         1002
1               3029         1000
2               7539         1005
2               7539         1005
2               7539         1002
2               1232         1006
3               5555         1000
3               5555         1001
3               5555         1002
3               5555         1003

Output Table

merchant_id     count_duplicate
1                3
2                2
3                4

So far I was able to return all duplicate rows - see code below:

df[df.duplicated('product_id',keep=False)==True

Erfan · Accepted Answer

We can use groupby with nunique and then filter (query) only the groups > 1. nunique stands for "number of unique values".

dfn = (
    df.groupby(['merchant_id', 'article_id'])['zip_code'].nunique()
    .reset_index(name='count_duplicate')
    .query('count_duplicate > 1')
)

   merchant_id  article_id  count_duplicate
1            1        4555                3
3            2        7539                2
4            3        5555                4

Find duplicates in pandas with condition (Python)

Answers (1)

Related Questions