How to use UUID5 or UUID3 in Pyspark while adding a column?

Question

I want to add a column to generate the unique number for the values in a column, but that randomly generated value should be fixed for every run.

For Ex: I have a df as

so every run the value of Harry should be xxyyzz and Misha should be gfddhh. it should not produce any random number every time.

I used below code, but it's not working:

df = df.withColumn(""Unique_Name",uuid.uuid5(uuid.NAMESPACE_DNS,'Name'))

it is giving error, I tried by converting it as str as well but still the error is same.

Any help would be much appreciated.

werner · Accepted Answer

You can use an udf:

from pyspark.sql import functions as F
import uuid

@F.udf
def create_uuid(name):
  return str(uuid.uuid5(uuid.NAMESPACE_DNS,name))

df.withColumn("Unique_Name",create_uuid('Name')).show(truncate=False)

Output:

+-----+------------------------------------+
|name |Unique_Name                         |
+-----+------------------------------------+
|Harry|195d4fd9-f3e0-50fb-8e70-fa31078bdbf9|
|Misha|1f473375-c514-5495-8ef3-a079548ffcfd|
+-----+------------------------------------+

How to use UUID5 or UUID3 in Pyspark while adding a column?

Answers (1)

Related Questions