Reputation: 2439

Select rows from with same values in one column but different value in the other column

I have some duplicates in my data that I need to correct.

This is a sample of a dataframe:

    test = pd.DataFrame({'event_id':['1','1','2','3','5','6','9','3','9','10'],
                 'user_id':[0,0,0,1,1,3,3,4,4,4],
                 'index':[10,20,30,40,50,60,70,80,90,100]})

I need to select all the rows that have equal values in event_id but differing values in user_id. I tried this (based on a similar question but with no accepted answer):

test.groupby('event_id').filter(lambda g: len(g) > 1).drop_duplicates(subset=['event_id', 'user_id'], keep="first")

out:

    event_id    user_id index
0   1           0       10
3   3           1       40
6   9           3       70
7   3           4       80
8   9           4       90

But I do not need the first row where user_id is the same - 0.

The second part of the question is - what is the best way to correct the duplicate record? How could I add a suffix to event_id (_new) but only in this row:

    event_id    user_id index
3   3_new       1       40
6   9_new       3       70
7   3           4       80
8   9           4       90

Upvotes: 1

Answers (2)

BENY

Reputation: 323226

Ummm, I try to fix your code

test.groupby('event_id').
      filter(lambda x : (len(x['event_id'])==x['user_id'].nunique())&(len(x['event_id'])>1))
Out[85]: 
  event_id  user_id  index
3        3        1     40
6        9        3     70
7        3        4     80
8        9        4     90

For Correct the duplicate row, you can do with create a new sub key , personally not recommended modify your original columns .

df['subkey']=df.groupby('event_id').cumcount()

Upvotes: 2

Scott Boston

Reputation: 153460

Try:

test[test.duplicated(['event_id'], keep=False) & 
     ~test.duplicated(['event_id','user_id'], keep=False)]

Output:

  event_id  user_id  index
3        3        1     40
6        9        3     70
7        3        4     80
8        9        4     90

Upvotes: 1

Select rows from with same values in one column but different value in the other column

Answers (2)

Related Questions