Reputation: 3632
When we write
RDD.map(x => x + 1)
This corresponds to a task that the master
will send to all the workers to execute inside their partition.
But I'm interested in the detail of these magics. Let's say we submit a jar that contains all these functions using spark-submit
. Once this jar submitted to the master
, how the master do to understand and extract all these transformations and send it to all the workers ? Does it use the reflecton
mechanism of java ?
For the sake of example, can you make a simple map
and use for example akka
under the hood to do the same magics?
Upvotes: 1
Views: 627
Reputation: 44992
The assembled uber-JAR is not submitted to the master per-se, but rather to the spark-submit
script. This script makes sure that the JAR is available to master and all worker nodes, and that all the classpaths are set correctly. Only then does it launch the application and starts the master node. Reflection is of no use here, because all the needed classes (including the compiled anonymous inner class that implements the closure in .map(x => x + 1)
) are available in the JAR itself. When it's time to apply the closure in map
, the master can use ordinary serialization to send the values on which the closure depends to the workers. The workers will then load the code of the closure from the JAR, supplement it with the necessary parameters from the deserialized closure, and then apply the closure to the RDDs.
You can of course implement rdd's with a map
using Akka (that's what Spark does), but that's not exactly simple, at least not simple enough to fit into a single SO answer.
The interactive Spark repl is again a completely different story from the spark-submit
script, because it has to compile new code while the application is running.
Upvotes: 1