How Spark distribute tasks to multiple workers

Question

When we write

RDD.map(x => x + 1)

This corresponds to a task that the master will send to all the workers to execute inside their partition.

But I'm interested in the detail of these magics. Let's say we submit a jar that contains all these functions using spark-submit. Once this jar submitted to the master, how the master do to understand and extract all these transformations and send it to all the workers ? Does it use the reflecton mechanism of java ?

For the sake of example, can you make a simple map and use for example akka under the hood to do the same magics?

Andrey Tyukin · Accepted Answer

The assembled uber-JAR is not submitted to the master per-se, but rather to the spark-submit script. This script makes sure that the JAR is available to master and all worker nodes, and that all the classpaths are set correctly. Only then does it launch the application and starts the master node. Reflection is of no use here, because all the needed classes (including the compiled anonymous inner class that implements the closure in .map(x => x + 1)) are available in the JAR itself. When it's time to apply the closure in map, the master can use ordinary serialization to send the values on which the closure depends to the workers. The workers will then load the code of the closure from the JAR, supplement it with the necessary parameters from the deserialized closure, and then apply the closure to the RDDs.

You can of course implement rdd's with a map using Akka (that's what Spark does), but that's not exactly simple, at least not simple enough to fit into a single SO answer.

The interactive Spark repl is again a completely different story from the spark-submit script, because it has to compile new code while the application is running.

How Spark distribute tasks to multiple workers

Answers (1)

Related Questions