Reputation: 2439
I have some duplicates in my data that I need to correct.
This is a sample of a dataframe:
test = pd.DataFrame({'event_id':['1','1','2','3','5','6','9','3','9','10'],
'user_id':[0,0,0,1,1,3,3,4,4,4],
'index':[10,20,30,40,50,60,70,80,90,100]})
I need to select all the rows that have equal values in event_id
but differing values in user_id
. I tried this (based on a similar question but with no accepted answer):
test.groupby('event_id').filter(lambda g: len(g) > 1).drop_duplicates(subset=['event_id', 'user_id'], keep="first")
out:
event_id user_id index
0 1 0 10
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
But I do not need the first row where user_id is the same - 0
.
The second part of the question is - what is the best way to correct the duplicate record? How could I add a suffix to event_id
(_new
) but only in this row:
event_id user_id index
3 3_new 1 40
6 9_new 3 70
7 3 4 80
8 9 4 90
Upvotes: 1
Views: 2449
Reputation: 323226
Ummm, I try to fix your code
test.groupby('event_id').
filter(lambda x : (len(x['event_id'])==x['user_id'].nunique())&(len(x['event_id'])>1))
Out[85]:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
For Correct the duplicate row, you can do with create a new sub key , personally not recommended modify your original columns .
df['subkey']=df.groupby('event_id').cumcount()
Upvotes: 2
Reputation: 153460
Try:
test[test.duplicated(['event_id'], keep=False) &
~test.duplicated(['event_id','user_id'], keep=False)]
Output:
event_id user_id index
3 3 1 40
6 9 3 70
7 3 4 80
8 9 4 90
Upvotes: 1