Learn4556
Learn4556

Reputation: 19

How to add a new column with random chars to pyspark dataframe

I am trying to add new column with a random 8 char string to every row of Spark Data Frame.

Function to generate 8 char string -

def id(size=8, chars=string.ascii_lowercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))  

My Spark DF -

columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df = df.withColumn("randomid", lit(id()))
df.show(truncate=False)

But with above code, random id is being duplicated. Any pointers on it to get it unique for each row?

+-----+------------+--------------------------------+
|Seqno|Name        |randomid                        |
+-----+------------+--------------------------------+
|1    |john jones  |uz6iugmraripznyzizt1ymvbs8gi2qv8|
|2    |tracey smith|uz6iugmraripznyzizt1ymvbs8gi2qv8|
|3    |amy sanders |uz6iugmraripznyzizt1ymvbs8gi2qv8|
+-----+------------+--------------------------------+

Upvotes: 0

Views: 2356

Answers (2)

ARCrow
ARCrow

Reputation: 1858

You can use shuffle transformation:

import string
import pyspark.sql.functions as f
source_characters = string.ascii_letters + string.digits

df = spark.createDataFrame([
    ("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")
], ['Seqno', 'Name'])

df = (df
      .withColumn('source_characters', f.split(f.lit(source_characters), ''))
      .withColumn('random_string', f.concat_ws('', f.slice(f.shuffle(f.col('source_characters')), 1, 8)))
      .drop('source_characters')
)

df.show()

and output looks like:

+-----+------------+-------------+                                              
|Seqno|        Name|random_string|
+-----+------------+-------------+
|    1|  john jones|     f8yWABgY|
|    2|tracey smith|     Xp6idNb7|
|    3| amy sanders|     zU8aSN4C|
+-----+------------+-------------+

Upvotes: 1

过过招
过过招

Reputation: 4189

You can use the uuid function to generate a string, and then replace the - in it.

df = df.withColumn("randomid", F.expr('replace(uuid(), "-", "")'))

Upvotes: 4

Related Questions