Kas1
Kas1

Reputation: 165

Understanding Lazy evaluation behavior in pyspark

I have a spark dataframe which looks something like this.

df.show()

| Id |
|----|
| 1  |
| 2  |
| 3  |

Now I want to add few columns with random integer assigned to them. I'm using the following udf to that. (I know we don't need an udf for this). This is my code.

random_udf = udf(lambda: random.randint(0, 1000), IntegerType())

df = df.withColumn("test_int", random_udf())
df.show()
| Id | test_int |
|----|----------|
| 1  | 51       |
| 2  | 111      |
| 3  | 552      |

Now if I add an another column and display it. The values in 'test_int' column are changing.

df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1  | 429      | 429       |
| 2  | 307      | 307       |
| 3  | 69       | 69        |

I realized that may be spark is evaluating the dataframe again at second display statement and added persist statement to my code. Now my code looks like this.

df = df.withColumn("test_int", random_udf()).persist()
df.rdd.count()  ## To kick off the evaluation
df.show()
| Id | test_int |
|----|----------|
| 1  | 459      |
| 2  | 552      |
| 3  | 89       |

df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1  | 459      | 459       |
| 2  | 552      | 552       |
| 3  | 89       | 89        |

No matter what I do both columns seem to be having same value. I'm looking for explanations for this behavior. I'm working in Azure databricks notebook (Pyspark 2.4.4).

Upvotes: 0

Views: 2233

Answers (1)

Napoleon Borntoparty
Napoleon Borntoparty

Reputation: 1962

Two points here:

  1. You need to understand that computers don't really do random numbers. What's happening here is that a seed gets set for your random_udf() - once this seed is set, the "random" will be repeated again and again, as you're asking it to do the same thing. In Data Science, this is very important, as it allows to be deterministic and allow for your experiments to be repeatable. See numpy.random.seed (https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html) and random.seed for more info.

  2. You should not really be using a udf for something like this. There is a perfectly good (and parallelised) pyspark.sql.functions.rand for this, which allows you to set a seed. See here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.rand

Upvotes: 4

Related Questions