Reputation: 90
I want to add a column to generate the unique number for the values in a column, but that randomly generated value should be fixed for every run.
For Ex: I have a df as
so every run the value of Harry should be xxyyzz and Misha should be gfddhh. it should not produce any random number every time.
I used below code, but it's not working:
df = df.withColumn(""Unique_Name",uuid.uuid5(uuid.NAMESPACE_DNS,'Name'))
it is giving error, I tried by converting it as str as well but still the error is same.
Any help would be much appreciated.
Upvotes: 1
Views: 1105
Reputation: 14845
You can use an udf:
from pyspark.sql import functions as F
import uuid
@F.udf
def create_uuid(name):
return str(uuid.uuid5(uuid.NAMESPACE_DNS,name))
df.withColumn("Unique_Name",create_uuid('Name')).show(truncate=False)
Output:
+-----+------------------------------------+
|name |Unique_Name |
+-----+------------------------------------+
|Harry|195d4fd9-f3e0-50fb-8e70-fa31078bdbf9|
|Misha|1f473375-c514-5495-8ef3-a079548ffcfd|
+-----+------------------------------------+
Upvotes: 1