Reputation: 2507
I would like to create unique identifiers for values that are duplicates. Values that are duplicates are only 0's. The idea is to convert each zero to zero plus its position (0+1 for first row, 0+2 for second row etc). However the problem is the column also has other non duplicate values.
I have written this line of code to try and convert the zero values as stated but I am getting this error message
TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('
Here is my code
seller_customer['customer_id'] = np.where(seller_customer['customer_id']==0, seller_customer['customer_id'] + seller_customer.groupby(['customer_id']).cumcount().replace('0',''))
Here is a sample of my data
{0: '7e468d618e16c6e1373fb2c4a522c969',
1: '1c14a115bead8a332738c5d7675cca8c',
2: '434dee65d973593dbb8461ba38202798',
3: '4bbeac9d9a22f0628ba712b90862df28',
4: '578d5098cbbe40771e1229fea98ccafd',
5: 0,
6: 0,
7: 0}
Upvotes: 0
Views: 1099
Reputation: 375
You can do something like this:
from pandas.util import hash_pandas_object
import numpy as np
df.x = np.where(df.x == 0, hash_pandas_object(df.x), df.x)
df
Output:
x
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 593769213749726025
6 14559158595676751865
7 4575103004772269825
They won't be sequential like the index but they will be unique (almost certainly, unless you encounter a hash collision)
Upvotes: 0
Reputation: 9941
If I understand correctly, you can just assign range values to those id
s that are 0
:
df.loc[df['id']==0, 'id'] = np.arange((df['id']==0).sum()) + 1
print(df)
Output:
id
0 7e468d618e16c6e1373fb2c4a522c969
1 1c14a115bead8a332738c5d7675cca8c
2 434dee65d973593dbb8461ba38202798
3 4bbeac9d9a22f0628ba712b90862df28
4 578d5098cbbe40771e1229fea98ccafd
5 1
6 2
7 3
Or a shorter but slightly slower:
df.loc[df['id']==0, 'id'] = (df['id']==0).cumsum()
Upvotes: 2