How to add a new column with random chars to pyspark dataframe

Question

I am trying to add new column with a random 8 char string to every row of Spark Data Frame.

Function to generate 8 char string -

def id(size=8, chars=string.ascii_lowercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

My Spark DF -

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df = df.withColumn("randomid", lit(id()))
df.show(truncate=False)

But with above code, random id is being duplicated. Any pointers on it to get it unique for each row?

+-----+------------+--------------------------------+
|Seqno|Name        |randomid                        |
+-----+------------+--------------------------------+
|1    |john jones  |uz6iugmraripznyzizt1ymvbs8gi2qv8|
|2    |tracey smith|uz6iugmraripznyzizt1ymvbs8gi2qv8|
|3    |amy sanders |uz6iugmraripznyzizt1ymvbs8gi2qv8|
+-----+------------+--------------------------------+

过过招 · Accepted Answer

You can use the uuid function to generate a string, and then replace the - in it.

df = df.withColumn("randomid", F.expr('replace(uuid(), "-", "")'))

How to add a new column with random chars to pyspark dataframe

Answers (2)

Related Questions