Does Spark schedule workers on the same nodes where the data resides?

Question

The Google MapReduce paper said that workers were scheduled on the same node as the data resided, or at least on the same rack if that was possible. I haven't read through the entire Hadoop documentation, but I assume that it moves the computation to the data if possible, rather than the data to the computation.

(When I first I learned about Hadoop, all data from HDFS to the workers had to go through a TCP connection, even when the worker was on the same node as the data. Is this still the case?)

In any event, with Apache Spark, do workers get scheduled on the same nodes as the data, or does the RDD concept make it harder to do that?

zero323 · Accepted Answer

Generally speaking it depends. Spark recognizes multiple levels of locality (including PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL) and tries to schedule tasks to achieve the best locality level. See Data Locality in Tuning Spark

Exact behavior can be controlled using spark.locality.* properties. It includes amount of time scheduler waits for free resources before choosing a node with a lower locality. See Scheduling in Spark Configuration.

Does Spark schedule workers on the same nodes where the data resides?

Answers (1)

Related Questions