Reputation: 19
I am trying to add new column with a random 8 char string to every row of Spark Data Frame.
Function to generate 8 char string -
def id(size=8, chars=string.ascii_lowercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
My Spark DF -
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df = df.withColumn("randomid", lit(id()))
df.show(truncate=False)
But with above code, random id is being duplicated. Any pointers on it to get it unique for each row?
+-----+------------+--------------------------------+
|Seqno|Name |randomid |
+-----+------------+--------------------------------+
|1 |john jones |uz6iugmraripznyzizt1ymvbs8gi2qv8|
|2 |tracey smith|uz6iugmraripznyzizt1ymvbs8gi2qv8|
|3 |amy sanders |uz6iugmraripznyzizt1ymvbs8gi2qv8|
+-----+------------+--------------------------------+
Upvotes: 0
Views: 2356
Reputation: 1858
You can use shuffle
transformation:
import string
import pyspark.sql.functions as f
source_characters = string.ascii_letters + string.digits
df = spark.createDataFrame([
("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")
], ['Seqno', 'Name'])
df = (df
.withColumn('source_characters', f.split(f.lit(source_characters), ''))
.withColumn('random_string', f.concat_ws('', f.slice(f.shuffle(f.col('source_characters')), 1, 8)))
.drop('source_characters')
)
df.show()
and output looks like:
+-----+------------+-------------+
|Seqno| Name|random_string|
+-----+------------+-------------+
| 1| john jones| f8yWABgY|
| 2|tracey smith| Xp6idNb7|
| 3| amy sanders| zU8aSN4C|
+-----+------------+-------------+
Upvotes: 1
Reputation: 4189
You can use the uuid
function to generate a string, and then replace the -
in it.
df = df.withColumn("randomid", F.expr('replace(uuid(), "-", "")'))
Upvotes: 4