Dongqing
Dongqing

Reputation: 686

How to calculate the max result size of Spark Driver

Recently I got an error that "spark.driver.MaxResultSize" was exceeded. I am using pyspark on yarn client mode. The code is to generate random fake data for testing.

new_df = None
for i in range(int(2000)):
    df = spark.range(0,10000)
    temp = df.select(f.col("id").alias('user_id'), f.round(1000 * f.abs(f.randn(seed=27))).alias("duration"))
    if new_df is None:
        new_df = temp
    else:
        new_df = new_df.union(temp)

I tried to increase the max result size to 15G to make it work. I am not sure why it required so much memory. Is there any guide on how to calculate the size of the result set?

Upvotes: 2

Views: 2868

Answers (1)

Ged
Ged

Reputation: 18053

The code is all being executed on the driver - not the worker(s) is my impression. e.g. the for and the df statement. Different to say reading from Hive or JDBC via DFReader.

The docs state:

spark.driver.maxResultSize 1g default Limit of total size of serialized results of all partitions for each Spark action (e.g. collect) in bytes. Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.

You might want to look at these for guidance: How to use spark to generate huge amount of random integers? and how to make rdd tuple list in spark? so as to distribute the load processing as well as increase spark.driver.maxResultSize if you wish to collect to the driver - which I would not.

Upvotes: 1

Related Questions