Vince
Vince

Reputation: 557

Unique UUID base on n columns of Pandas dataframe (to handle duplicates on ElasticSearch)

i am creating a function to set an UUID column base on the values of other columns. What i want is to handle duplicates when indexind dataframes into Elasticsearch. The UUID should be always the same based on the value of several columns.

I am having problems with the output, the same UUID is generated for each row.

Dataframe

cols = ['col1', 'col2']
data = {'col1': ['Mike','Robert','Sandy'],
        'col2': ['100','200','300']}


        col1 col2  
    0    Mike  100 
    1  Robert  200 
    2   Sandy  300

Function

def create_uuid_on_n_col (df):
    # concat column string values 
    concat_col_str_id = df.apply(lambda x: uuid.uuid5(uuid.NAMESPACE_DNS,'_'.join(map(str, x))), axis=1)
    return concat_col_str_id[0]

Output

df['id'] = create_uuid_2_col(df[['col1','col2']])

     col1 col2                                    id
0    Mike  100  a17ad043-486f-5eeb-8138-8fa2b10659fd
1  Robert  200  a17ad043-486f-5eeb-8138-8fa2b10659fd
2   Sandy  300  a17ad043-486f-5eeb-8138-8fa2b10659fd

Upvotes: 1

Views: 1512

Answers (1)

gold_cy
gold_cy

Reputation: 14216

There's no need to define another helper function. We can also vectorize the joining of the columns as shown below.

from functools import partial

p = partial(uuid.uuid5, uuid.NAMESPACE_DNS)

df.assign(id=(df.col1 + '_' + df.col2).apply(p))

     col1  col2                                    id
0    Mike   100  a17ad043-486f-5eeb-8138-8fa2b10659fd
1  Robert   200  e520efd5-157a-57ee-84fb-41b9872af407
2   Sandy   300  11208b7c-b99b-5085-ad98-495004e6b043

If you don't want to import partial then define a function.

def custom_uuid(data):
    val = uuid.uuid5(uuid.NAMESPACE_DNS, data)
    return val

df.assign(id=(df.col1 + '_' + df.col2).apply(custom_uuid))

Using your original function as shown below.

def create_uuid_on_n_col(df):
    temp = df.agg('_'.join, axis=1)
    return df.assign(id=temp.apply(custom_uuid))

create_uuid_on_n_col(df[['col1','col2']])

     col1 col2                                    id
0    Mike  100  a17ad043-486f-5eeb-8138-8fa2b10659fd
1  Robert  200  e520efd5-157a-57ee-84fb-41b9872af407
2   Sandy  300  11208b7c-b99b-5085-ad98-495004e6b043

Upvotes: 2

Related Questions