Reputation: 21599
I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors on the same node? How is this able to share data between the JVMs of the executors running on the same node?
Upvotes: 16
Views: 8861
Reputation: 1748
I assume you ask how executors can share mutable state. if you only need to share immutable data, then you can just refer to @Stanislav's answer.
if you need mutable state between executors, there are quite a few approaches:
Upvotes: 0
Reputation: 1724
Yes, you could use broadcast variables when considering your data is readonly (immutable). the broadcast variable must satisfy the following properties.
So, here the only condition is your data have to be able to fit in memory on one node. That means the data should NOT be anything super large or beyond the memory limits like a massive table.
Each executer receives a copy of the broadcast variable and all the tasks in that particular executor are reading/using that data. It's like sending a large, read-only data to all the worker nodes in the cluster. i.e., ship to each worker only once instead of with each task and executors (it's tasks) read the data.
Upvotes: 9