Reputation: 557
i am creating a function to set an UUID column base on the values of other columns. What i want is to handle duplicates when indexind dataframes into Elasticsearch. The UUID should be always the same based on the value of several columns.
I am having problems with the output, the same UUID is generated for each row.
Dataframe
cols = ['col1', 'col2']
data = {'col1': ['Mike','Robert','Sandy'],
'col2': ['100','200','300']}
col1 col2
0 Mike 100
1 Robert 200
2 Sandy 300
Function
def create_uuid_on_n_col (df):
# concat column string values
concat_col_str_id = df.apply(lambda x: uuid.uuid5(uuid.NAMESPACE_DNS,'_'.join(map(str, x))), axis=1)
return concat_col_str_id[0]
Output
df['id'] = create_uuid_2_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 a17ad043-486f-5eeb-8138-8fa2b10659fd
2 Sandy 300 a17ad043-486f-5eeb-8138-8fa2b10659fd
Upvotes: 1
Views: 1512
Reputation: 14216
There's no need to define another helper function. We can also vectorize the joining of the columns as shown below.
from functools import partial
p = partial(uuid.uuid5, uuid.NAMESPACE_DNS)
df.assign(id=(df.col1 + '_' + df.col2).apply(p))
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043
If you don't want to import partial then define a function.
def custom_uuid(data):
val = uuid.uuid5(uuid.NAMESPACE_DNS, data)
return val
df.assign(id=(df.col1 + '_' + df.col2).apply(custom_uuid))
Using your original function as shown below.
def create_uuid_on_n_col(df):
temp = df.agg('_'.join, axis=1)
return df.assign(id=temp.apply(custom_uuid))
create_uuid_on_n_col(df[['col1','col2']])
col1 col2 id
0 Mike 100 a17ad043-486f-5eeb-8138-8fa2b10659fd
1 Robert 200 e520efd5-157a-57ee-84fb-41b9872af407
2 Sandy 300 11208b7c-b99b-5085-ad98-495004e6b043
Upvotes: 2