Reputation: 312
How does a Spark Executor execute the code? Does it have multiple threads running? If yes, will it open multiple JDBC conenctions to read/write data from/to RDBMS?
Upvotes: 11
Views: 13695
Reputation: 3709
You can easily test this by running spark on your local.
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("JDBCTest")
val sc = new SparkContext(conf)
In above snippet local[2] means two threads. Now if you open a JDBC connection while processing RDD's, spark will do this for each task.
Transformation and Actions run parallel in spark, by design spark is more efficient in running in-memory tasks so at first place we should avoid writing a code which requires opening JDBC connection for each RDD, instead you can load it in memory for processing, see snippet below.
Dataset<Row> jdbcDF = spark.read().format("jdbc").option("url", mySQLConnectionURL)
.option("driver", MYSQL_DRIVER).option("dbtable", sql).option("user", userId)
.option("password", dbpassword).load();
Cheers !
Upvotes: 0
Reputation: 74619
How does a Spark Executor execute the code?
The beauty of open source, the Apache Spark project including, is that you can see the code and find the answer yourself. It's not to say that this is the best and only way to find the answer, but mine might not be as clear as the code itself (the opposite can also be true :))
With that said, see the code of Executor yourself.
Does it have multiple threads running?
Yes. See this line where Executor
creates a new TaskRunner
that is a Java Runnable
(a separate thread). That Runnable
is going to be executed on the thread pool.
Quoting Java's Executors.newCachedThreadPool that Spark uses for the thread pool:
Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available, and uses the provided ThreadFactory to create new threads when needed.
If yes, will it open multiple JDBC conenctions to read/write data from RDBMS?
I'm sure you know the answer already. Yes, it will open multiple connections and that why you should be using foreachPartition
operation to _"apply a function f
to each partition of this Dataset." (same applies to RDDs) and some kind of connection pool.
Upvotes: 10
Reputation: 27373
Yes, if you set spark.executor.cores
to more than 1, then your executor will have multiple parallel threads and yes, I guess then mutliple JDBC connects will be opened
Upvotes: 1