Understanding Lazy evaluation behavior in pyspark

Question

I have a spark dataframe which looks something like this.

df.show()

| Id |
|----|
| 1  |
| 2  |
| 3  |

Now I want to add few columns with random integer assigned to them. I'm using the following udf to that. (I know we don't need an udf for this). This is my code.

random_udf = udf(lambda: random.randint(0, 1000), IntegerType())

df = df.withColumn("test_int", random_udf())
df.show()
| Id | test_int |
|----|----------|
| 1  | 51       |
| 2  | 111      |
| 3  | 552      |

Now if I add an another column and display it. The values in 'test_int' column are changing.

df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1  | 429      | 429       |
| 2  | 307      | 307       |
| 3  | 69       | 69        |

I realized that may be spark is evaluating the dataframe again at second display statement and added persist statement to my code. Now my code looks like this.

df = df.withColumn("test_int", random_udf()).persist()
df.rdd.count()  ## To kick off the evaluation
df.show()
| Id | test_int |
|----|----------|
| 1  | 459      |
| 2  | 552      |
| 3  | 89       |

df = df.withColumn("test_int1", random_udf())
df.show()
| Id | test_int | test_int1 |
|----|----------|-----------|
| 1  | 459      | 459       |
| 2  | 552      | 552       |
| 3  | 89       | 89        |

No matter what I do both columns seem to be having same value. I'm looking for explanations for this behavior. I'm working in Azure databricks notebook (Pyspark 2.4.4).

Understanding Lazy evaluation behavior in pyspark

Answers (1)

Related Questions