user045213
user045213

Reputation: 41

apache spark executors and data locality

The spark literature says

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads.

And If I understand this right, In static allocation the executors are acquired by the Spark application when the Spark Context is created on all nodes in the cluster (in a cluster mode). I have a couple of questions

  1. If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?

  2. What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.

  3. So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?

I have checked a related question Does Spark on yarn deal with Data locality while launching executors

But I'm not sure there is a conclusive answer

Upvotes: 2

Views: 638

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6964

  1. If executors are acquired on all nodes and will stay allocated to this application during the the duration of the whole application, isn't there a chance a lot of nodes remain idle?

Yes there is chance. If you have data skew this will happen. The challenge is to tune the executors and executor core so that you get maximum utilization. Spark also provides dynamic resource allocation which ensures the idle executors are removed.

  1. What is the advantage of acquiring resources when Spark context is created and not in the DAGScheduler? I mean the application could be arbitrarily long and it is just holding the resources.

Spark tries to keep data in memory while doing transformation. Contrary to map-reduce model where after every Map operation it writes to disk. Spark can keep the data in memory only if it can ensure the code is executed in the same machine. This is the reason of allocating resource beforehand.

  1. So when the DAGScheduler tries to get the preferred locations and the executors in those nodes are running the tasks, would it relinquish the executors on other nodes?

Spark can't start a task on an executor unless the executor is free. Now spark application master negotiates with the yarn to get the preferred location. It may or may not get that. If it doesn't get, it will start task in different executor.

Upvotes: 2

Related Questions