Reputation: 1237
In doing Spark performance tuning, I've found (unsurprisingly) that doing broadcast joins eliminates shuffles and improves performance. I've been experimenting with broadcasting on larger joins, and I've been able to successfully use far larger broadcast joins that I expected -- e.g. broadcasting a 2GB compressed (and much larger uncompressed) dataset, running on a 60-node cluster with 30GB memory/node.
However, I have concerns about putting this into production, as the size of our data fluctuates, and I'm wondering what will happen if the broadcast becomes "too large". I'm imagining two scenarios:
A) Data is too big to fit in memory, so some of it gets written to disk, and performance degrades slightly. This would be okay. Or,
B) Data is too big to fit in memory, so it throws an OutOfMemoryError and crashes the whole application. Not so okay.
So my question is: What happens when a Spark broadcast join is too large?
Upvotes: 4
Views: 2012
Reputation: 36
Broadcast variables are plain local objects and excluding distribution and serialization they the behave as any other object you use. If they don't fit into memory you'll get OOM. Other than memory paging there is no magic that can prevent that.
So broadcasting is not applicable for objects that may not fit into memory (and leave a lot of free memory for standard Spark operations).
Upvotes: 2