Puneet Singh
Puneet Singh

Reputation: 312

Are Spark executors multi-threaded?

How does a Spark Executor execute the code? Does it have multiple threads running? If yes, will it open multiple JDBC conenctions to read/write data from/to RDBMS?

Upvotes: 11

Views: 13695

Answers (3)

Sachin Thapa
Sachin Thapa

Reputation: 3709

You can easily test this by running spark on your local.

val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("JDBCTest")
val sc = new SparkContext(conf)

In above snippet local[2] means two threads. Now if you open a JDBC connection while processing RDD's, spark will do this for each task.

Transformation and Actions run parallel in spark, by design spark is more efficient in running in-memory tasks so at first place we should avoid writing a code which requires opening JDBC connection for each RDD, instead you can load it in memory for processing, see snippet below.

Dataset<Row> jdbcDF = spark.read().format("jdbc").option("url", mySQLConnectionURL)
        .option("driver", MYSQL_DRIVER).option("dbtable", sql).option("user", userId)
        .option("password", dbpassword).load();

Cheers !

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

How does a Spark Executor execute the code?

The beauty of open source, the Apache Spark project including, is that you can see the code and find the answer yourself. It's not to say that this is the best and only way to find the answer, but mine might not be as clear as the code itself (the opposite can also be true :))

With that said, see the code of Executor yourself.

Does it have multiple threads running?

Yes. See this line where Executor creates a new TaskRunner that is a Java Runnable (a separate thread). That Runnable is going to be executed on the thread pool.

Quoting Java's Executors.newCachedThreadPool that Spark uses for the thread pool:

Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available, and uses the provided ThreadFactory to create new threads when needed.

If yes, will it open multiple JDBC conenctions to read/write data from RDBMS?

I'm sure you know the answer already. Yes, it will open multiple connections and that why you should be using foreachPartition operation to _"apply a function f to each partition of this Dataset." (same applies to RDDs) and some kind of connection pool.

Upvotes: 10

Raphael Roth
Raphael Roth

Reputation: 27373

Yes, if you set spark.executor.cores to more than 1, then your executor will have multiple parallel threads and yes, I guess then mutliple JDBC connects will be opened

Upvotes: 1

Related Questions