Emm
Emm

Reputation: 2507

Adding unique identifiers to duplicate values in pandas dataframe

I would like to create unique identifiers for values that are duplicates. Values that are duplicates are only 0's. The idea is to convert each zero to zero plus its position (0+1 for first row, 0+2 for second row etc). However the problem is the column also has other non duplicate values.

I have written this line of code to try and convert the zero values as stated but I am getting this error message

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('

Here is my code

seller_customer['customer_id'] = np.where(seller_customer['customer_id']==0, seller_customer['customer_id'] + seller_customer.groupby(['customer_id']).cumcount().replace('0',''))

Here is a sample of my data

{0: '7e468d618e16c6e1373fb2c4a522c969',
 1: '1c14a115bead8a332738c5d7675cca8c',
 2: '434dee65d973593dbb8461ba38202798',
 3: '4bbeac9d9a22f0628ba712b90862df28',
 4: '578d5098cbbe40771e1229fea98ccafd',
 5:  0,
 6:  0,
 7:  0}

Upvotes: 0

Views: 1099

Answers (2)

arinarmo
arinarmo

Reputation: 375

You can do something like this:

    from pandas.util import hash_pandas_object
    import numpy as np
    df.x = np.where(df.x == 0, hash_pandas_object(df.x), df.x)
    df

Output:

                                      x
    0  7e468d618e16c6e1373fb2c4a522c969
    1  1c14a115bead8a332738c5d7675cca8c
    2  434dee65d973593dbb8461ba38202798
    3  4bbeac9d9a22f0628ba712b90862df28
    4  578d5098cbbe40771e1229fea98ccafd
    5                593769213749726025
    6              14559158595676751865
    7               4575103004772269825

They won't be sequential like the index but they will be unique (almost certainly, unless you encounter a hash collision)

Upvotes: 0

perl
perl

Reputation: 9941

If I understand correctly, you can just assign range values to those ids that are 0:

df.loc[df['id']==0, 'id'] = np.arange((df['id']==0).sum()) + 1

print(df)

Output:

                                 id
0  7e468d618e16c6e1373fb2c4a522c969
1  1c14a115bead8a332738c5d7675cca8c
2  434dee65d973593dbb8461ba38202798
3  4bbeac9d9a22f0628ba712b90862df28
4  578d5098cbbe40771e1229fea98ccafd
5                                 1
6                                 2
7                                 3

Or a shorter but slightly slower:

df.loc[df['id']==0, 'id'] = (df['id']==0).cumsum()

Upvotes: 2

Related Questions