I am trying to understand the sequence of events related to the creation of a driver program on spark-submit in cluster and client mode Spark-Submit Let's say I am on my machine and I do a spark-submit with the Yarn resource manager and deploy mode is cluster Now, when a driver is created? Is it before the execution of the main program? or is when Spark Session is being created? My understanding: The spark-submit bash script interacts with the resource manager and asks for a container for running the main program. Once the container is initiated the spark-submit script runs the main program on the cluster container. Once the main program is executed then spark context interacts with the resource manager to create containers for executors. Now, if this is a correct understanding then what happens when we simply run a python script on a local machine with cluster mode?

Reputation: 1822

Apache Spark when and what creates the driver?

I am trying to understand the sequence of events related to the creation of a driver program on spark-submit in cluster and client mode

Spark-Submit

Let's say I am on my machine and I do a spark-submit with the Yarn resource manager and deploy mode is cluster

Now, when a driver is created? Is it before the execution of the main program? or is when Spark Session is being created?

My understanding:

The spark-submit bash script interacts with the resource manager and asks for a container for running the main program.
Once the container is initiated the spark-submit script runs the main program on the cluster container.
Once the main program is executed then spark context interacts with the resource manager to create containers for executors.

Now, if this is a correct understanding then what happens when we simply run a python script on a local machine with cluster mode?

Upvotes: 1

Answers (2)

pltc

Reputation: 6082

Spark has two deploy modes: client and cluster.

client mode is the mode where the computer you submitted Spark jobs is the driver. That could be your local computer, or usually, it could be a so-called "edge node". With this mode, the driver shares its resources with many other software, and most of the time it's not optimal and reliable (think about the case when you submit a job while running something super heavy in your computer at the same time)
cluster mode is the mode where YARN is the one who picks a node among the cluster's available nodes and makes it the driver. So it will try to pick the best one and you don't have to worry about its resources anymore.

what happens when we simply run a python script on a local machine with cluster mode?

You now probably have some sense about the answer to this question: if you simply run a python script on a local machine, it would be client mode, spark job will use that local computer resources as part of Spark computation. On the other hand, with cluster mode, another computer will run as driver, not your local machine.

Upvotes: 0

Ged

Reputation: 18033

See https://blog.knoldus.com/understanding-the-working-of-spark-driver-and-executor/ I can't explain it any better than this. See also https://spark.apache.org/docs/latest/submitting-applications.html

This answers more than your question. An excellent read.

Let’s say a user submits a job using “spark-submit”.

“spark-submit” will in-turn launch the Driver which will execute the main() method of our code.
Driver contacts the cluster manager and requests for resources to launch the -Executors.
The cluster manager launches the Executors on behalf of the Driver.
Once the Executors are launched, they establish a direct connection with the Driver.
The driver determines the total number of Tasks by checking the Lineage.
The driver creates the Logical and Physical Plan.
Once the Physical Plan is generated, Spark allocates the Tasks to the Executors.
Task runs on Executor and each Task upon completion returns the result to the Driver.
Finally, when all Task is completed, the main() method running in the Driver exits, i.e. main() method invokes sparkContext.stop().
Finally, Spark releases all the resources from the Cluster Manager.

Upvotes: 3

Apache Spark when and what creates the driver?

Answers (2)

Related Questions