Reputation: 31
I have came across abnormal behaviour,
I have a query (inside loop) in which I have inner joins over 5 tables one with around 200MB and all other are under 10MB (All persisted at the start of loop, and unpersisted at the end of loop).
Whenever I use spark.sql.autoBroadcastJoinThreshold (tried default, 5MB, 1MB and 100KB), after running same query multiple times it keeps on adding driver memory and eventually fails because of out of memory ( WARN TaskMemoryManager: Failed to allocate a page (16777216 bytes), try again.)
But, If I try same thing with spark.sql.autoBroadcastJoinThreshold=-1, it works without any issues.
My Spark(2.0.0) config is :
driver memory : 10g Executor memory : 20g cores : 3 Nodes : 5
( I guess I'm giving more resources than needed, but it doesn't work even if I reduce executor memory to 4g. It processes same number of times irrespective of memory configuration. )
PS: I am not creating any broadcast variables manually.
and I am new to Spark.
Upvotes: 3
Views: 4317
Reputation: 2825
Looking at the stacktrace it looks like the size of the dataset being broadcasted is around 16MB so you might want to set the value of broadcast threshold higher than 16MB to see if it works.
The other option that you have mentioned is to disable the broadcast but you would want to check the performance of your SQL to see if there is any adverse impact.
Upvotes: 0