Reputation: 153
What is really executed and where, when using jdbc drivers to connect to e.g. oracle.? 1: I have started a spark master as
spark-class.cmd org.apache.spark.deploy.master.Master
and a worker like so
spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
and spark shell as
spark-shell --master spark://myip:7077
in spark-defaults.conf
I have
spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
and in spark-env.sh I have
SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar
I can now run queries against Oracle in the spark-shell:
val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:@...
This works fine without separately adding the jdbc driver jar in the scala shell.
When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:
val sparkSession = SparkSession.builder.
master("spark://myip:7077")
.appName("SparkTestApp")
.config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
.getOrCreate()
then it fails if I don't explicitly add the jdbc jar in the scala code. How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started? If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?
Upvotes: 1
Views: 1516
Reputation: 74669
You are certainly using too much for the sample and you got confused.
The two lines, spark-class.cmd org.apache.spark.deploy.master.Master
and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077
, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).
I doubt that spark-defaults.conf
is used by the cluster at all. The file is to configure your Spark applications that are spark-submit
to a cluster (as described in Dynamically Loading Spark Properties):
bin/spark-submit will also read configuration options from
conf/spark-defaults.conf
, in which each line consists of a key and a value separated by whitespace.
With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).
In order to use a JDBC driver in your Spark application, you should spark-submit
with --driver-class-path
command-line option (or spark.driver.extraClassPath
property as described in Runtime Environment):
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.
I strongly recommend using spark-submit --driver-class-path
.
$ ./bin/spark-submit --help
...
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).
PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1
to check out the command line of spark-submit
.
All above applies to spark-shell
too (as it uses spark-submit
under the covers).
Upvotes: 1