How to calculate the max result size of Spark Driver

Question

Recently I got an error that "spark.driver.MaxResultSize" was exceeded. I am using pyspark on yarn client mode. The code is to generate random fake data for testing.

new_df = None
for i in range(int(2000)):
    df = spark.range(0,10000)
    temp = df.select(f.col("id").alias('user_id'), f.round(1000 * f.abs(f.randn(seed=27))).alias("duration"))
    if new_df is None:
        new_df = temp
    else:
        new_df = new_df.union(temp)

I tried to increase the max result size to 15G to make it work. I am not sure why it required so much memory. Is there any guide on how to calculate the size of the result set?

How to calculate the max result size of Spark Driver

Answers (1)

Related Questions