Reputation: 165
I have a spark dataframe which looks something like this.
df.show()
| Id |
|----|
| 1 |
| 2 |
| 3 |
Now I want to add few columns with random integer assigned to them. I'm using the following udf to that. (I know we don't need an udf for this). This is my code.
random_udf = udf(lambda: random.randint(0, 1000), IntegerType())
df = df.withColumn("test_int", random_udf())
df.show()
| Id | test_int |
|----|----------|
| 1 | 51 |
| 2 | 111 |
| 3 | 552 |
Now if I add an another column and display it. The values in 'test_int' column are changing.
df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1 | 429 | 429 |
| 2 | 307 | 307 |
| 3 | 69 | 69 |
I realized that may be spark is evaluating the dataframe again at second display statement and added persist statement to my code. Now my code looks like this.
df = df.withColumn("test_int", random_udf()).persist()
df.rdd.count() ## To kick off the evaluation
df.show()
| Id | test_int |
|----|----------|
| 1 | 459 |
| 2 | 552 |
| 3 | 89 |
df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1 | 459 | 459 |
| 2 | 552 | 552 |
| 3 | 89 | 89 |
No matter what I do both columns seem to be having same value. I'm looking for explanations for this behavior. I'm working in Azure databricks notebook (Pyspark 2.4.4).
Upvotes: 0
Views: 2233
Reputation: 1962
Two points here:
You need to understand that computers don't really do random numbers. What's happening here is that a seed
gets set for your random_udf()
- once this seed is set, the "random" will be repeated again and again, as you're asking it to do the same thing. In Data Science, this is very important, as it allows to be deterministic and allow for your experiments to be repeatable. See numpy.random.seed
(https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html) and random.seed
for more info.
You should not really be using a udf
for something like this. There is a perfectly good (and parallelised) pyspark.sql.functions.rand
for this, which allows you to set a seed
. See here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.rand
Upvotes: 4