Jarad
Jarad

Reputation: 18953

Python Pandas Identify Duplicate IDs That Have Conflicting Labels

I've been trying to figure this out for like 2 days. In my example below, do you see how the Ad state is BOTH paused and enabled for AD ID 13756121050 in the top two rows? Well, in my case, each Ad ID can only have either paused or enabled throughout the dataframe.

I'm trying to figure out how to make new DataFrame containing the conflicting rows where the SAME Ad ID has both paused and enabled Ad state.

The result would be the top two rows you see below because they share the same Ad ID and the have conflicting Ad states (paused, and enabled which is impossible for my purposes).

import pandas as pd
d = {'Ad state': {1092: 'enabled', 216: 'enabled', 1124: 'enabled', 213: 'paused', 1094: 'enabled', 1128: 'enabled', 1129: 'paused', 1131: 'paused', 236: 'enabled', 238: 'enabled'}, 'Based on': {1092: 'CTR', 216: 'Conv. rate', 1124: 'Conv. rate', 213: 'CTR', 1094: 'Conv. rate', 1128: 'CTR', 1129: 'CTR', 1131: 'Conv. rate', 236: 'CTR', 238: 'Conv. rate'}, 'Ad ID': {1092: 13756190410, 216: 13756121050, 1124: 13756111570, 213: 13756121050, 1094: 13756190410, 1128: 13756111570, 1129: 13756111690, 1131: 13756111690, 236: 13756120810, 238: 13756120810}}
df = pd.DataFrame(d, columns=['Ad ID', 'Ad state', 'Based on'])
#df prints this out
            Ad ID Ad state    Based on
213   13756121050   paused  Conv. rate
216   13756121050  enabled         CTR
236   13756120810  enabled         CTR
238   13756120810  enabled  Conv. rate
1092  13756190410  enabled         CTR
1094  13756190410  enabled  Conv. rate
1124  13756111570  enabled  Conv. rate
1128  13756111570  enabled         CTR
1129  13756111690   paused         CTR
1131  13756111690   paused  Conv. rate

pd.pivot_table(df, index='Ad ID', columns='Ad state', aggfunc='count', fill_value='')

#See how 13756121050 has BOTH enabled and paused status?
            Based on       
Ad state     enabled paused
Ad ID                      
13756111570        2       
13756111690               2
13756120810        2       
13756121050        1      1
13756190410        2       

Upvotes: 0

Views: 162

Answers (2)

Alexander
Alexander

Reputation: 109666

If you want to have a field in your original database to indicated whether or not the data is consistent, you can use transform.

df['consistent'] = df.groupby(['Ad_ID'])['Ad_state'].transform(lambda x: x.nunique() <= 1)

>>> df
            Ad_ID Ad_state    Based_on consistent
213   13756121050   paused  Conv. rate      False
216   13756121050  enabled         CTR      False
236   13756120810  enabled         CTR       True
238   13756120810  enabled  Conv. rate       True
1092  13756190410  enabled         CTR       True
1094  13756190410  enabled  Conv. rate       True
1124  13756111570  enabled  Conv. rate       True
1128  13756111570  enabled         CTR       True
1129  13756111690   paused         CTR       True
1131  13756111690   paused  Conv. rate       True

Upvotes: 1

Jianxun Li
Jianxun Li

Reputation: 24752

You can use .groupby with .filter, where the filter rule applies to each subgroup.

df.groupby('Ad ID').filter(lambda g: len(g['Ad state'].unique()) > 1)

           Ad ID Ad state    Based on
213  13756121050   paused         CTR
216  13756121050  enabled  Conv. rate

Upvotes: 4

Related Questions