Spark: Why execution is carried by a master node but not worker nodes?

Question

I have a spark cluster that consists of a master and two worker nodes.

When executing the following code to extract data from a database, the actual execution is carried out by the master not one of the workers.

    sparkSession.read
      .format("jdbc")
      .option("url", jdbcURL)
      .option("user", user)
      .option("query", query)
      .option("driver", driverClass)
      .option("fetchsize", fetchsize)
      .option("numPartitions", numPartitions)
      .option("queryTimeout", queryTimeout)
      .options(options)
      .load()

Is this an expected behavior?

Is there some way to disable this behavior?

Dagang Wei · Accepted Answer

Spark applications have two types of runners: driver and executor, and two types of operations: transformation and action. According to this doc:

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

...

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

So in a Spark application, some operations are executed in executors, some operations are executed in drivers. On Dataproc, executors are always in YARN containers on worker node. But drivers could be on master node or worker nodes. The default is called "client mode", which means drivers run on master node outside of YARN. But you can use gcloud dataproc jobs submit spark ... --properties spark.submit.deployMode=cluster to enable "cluster mode", which will run drivers in YARN containers on worker nodes. See this doc for more details.

Spark: Why execution is carried by a master node but not worker nodes?

Answers (1)

Related Questions