Python Pandas Identify Duplicate IDs That Have Conflicting Labels

Question

I've been trying to figure this out for like 2 days. In my example below, do you see how the Ad state is BOTH paused and enabled for AD ID 13756121050 in the top two rows? Well, in my case, each Ad ID can only have either paused or enabled throughout the dataframe.

I'm trying to figure out how to make new DataFrame containing the conflicting rows where the SAME Ad ID has both paused and enabled Ad state.

The result would be the top two rows you see below because they share the same Ad ID and the have conflicting Ad states (paused, and enabled which is impossible for my purposes).

import pandas as pd
d = {'Ad state': {1092: 'enabled', 216: 'enabled', 1124: 'enabled', 213: 'paused', 1094: 'enabled', 1128: 'enabled', 1129: 'paused', 1131: 'paused', 236: 'enabled', 238: 'enabled'}, 'Based on': {1092: 'CTR', 216: 'Conv. rate', 1124: 'Conv. rate', 213: 'CTR', 1094: 'Conv. rate', 1128: 'CTR', 1129: 'CTR', 1131: 'Conv. rate', 236: 'CTR', 238: 'Conv. rate'}, 'Ad ID': {1092: 13756190410, 216: 13756121050, 1124: 13756111570, 213: 13756121050, 1094: 13756190410, 1128: 13756111570, 1129: 13756111690, 1131: 13756111690, 236: 13756120810, 238: 13756120810}}
df = pd.DataFrame(d, columns=['Ad ID', 'Ad state', 'Based on'])
#df prints this out
            Ad ID Ad state    Based on
213   13756121050   paused  Conv. rate
216   13756121050  enabled         CTR
236   13756120810  enabled         CTR
238   13756120810  enabled  Conv. rate
1092  13756190410  enabled         CTR
1094  13756190410  enabled  Conv. rate
1124  13756111570  enabled  Conv. rate
1128  13756111570  enabled         CTR
1129  13756111690   paused         CTR
1131  13756111690   paused  Conv. rate

pd.pivot_table(df, index='Ad ID', columns='Ad state', aggfunc='count', fill_value='')

#See how 13756121050 has BOTH enabled and paused status?
            Based on       
Ad state     enabled paused
Ad ID                      
13756111570        2       
13756111690               2
13756120810        2       
13756121050        1      1
13756190410        2

Jianxun Li · Accepted Answer

You can use .groupby with .filter, where the filter rule applies to each subgroup.

df.groupby('Ad ID').filter(lambda g: len(g['Ad state'].unique()) > 1)

           Ad ID Ad state    Based on
213  13756121050   paused         CTR
216  13756121050  enabled  Conv. rate

Python Pandas Identify Duplicate IDs That Have Conflicting Labels

Answers (2)

Related Questions