user979899
user979899

Reputation: 153

How to use JDBC to read datasets from Oracle?

What is really executed and where, when using jdbc drivers to connect to e.g. oracle.? 1: I have started a spark master as

spark-class.cmd org.apache.spark.deploy.master.Master 

and a worker like so

spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077  

and spark shell as

spark-shell --master spark://myip:7077  

in spark-defaults.conf I have

spark.driver.extraClassPath = C:/jdbcDrivers/ojdbc8.jar
spark.executor.extraClassPath = C:/jdbcDrivers/ojdbc8.jar

and in spark-env.sh I have

SPARK_CLASSPATH=C:/jdbcDrivers/ojdbc8.jar

I can now run queries against Oracle in the spark-shell:

val jdbcDF = spark.read.format("jdbc").option("url","jdbc:oracle:thin:@...

This works fine without separately adding the jdbc driver jar in the scala shell.

  1. When I start the master and worker in the same way, but create a scala project in eclipse and connecting to the master as follows:

    val sparkSession = SparkSession.builder.
    master("spark://myip:7077")
    .appName("SparkTestApp")
    .config("spark.jars", "C:\\pathToJdbc\\ojdbc8.jar")
    .getOrCreate()    
    

then it fails if I don't explicitly add the jdbc jar in the scala code. How is the execution different? Why do I need to specify the jdbc jar in the code? What is the purpose of connecting to the master if it doesn't rely on the master and workers started? If I use multiple workers with jdbc will they use only one connection or will they simultaneously read in parallel over several connections?

Upvotes: 1

Views: 1516

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74669

You are certainly using too much for the sample and you got confused.

The two lines, spark-class.cmd org.apache.spark.deploy.master.Master and spark-class.cmd org.apache.spark.deploy.worker.Worker spark://myip:7077, started a Spark Standalone cluster with one master and one worker. See Spark Standalone Mode.

In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.

You chose to start the Spark Standalone cluster manually (as described in Starting a Cluster Manually).

I doubt that spark-defaults.conf is used by the cluster at all. The file is to configure your Spark applications that are spark-submit to a cluster (as described in Dynamically Loading Spark Properties):

bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace.

With that said, I think we can safely put Spark Standalone aside. It does not add much to the discussion (and does confuse a bit).

"Installing" JDBC Driver for Spark Application

In order to use a JDBC driver in your Spark application, you should spark-submit with --driver-class-path command-line option (or spark.driver.extraClassPath property as described in Runtime Environment):

spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.

Note: In client mode, this config must not be set through the SparkConf directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-class-path command line option or in your default properties file.

I strongly recommend using spark-submit --driver-class-path.

$ ./bin/spark-submit --help
...
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

You can read my notes on how to use a JDBC driver with PostgreSQL in Working with Datasets from JDBC Data Sources (and PostgreSQL).

PROTIP Use SPARK_PRINT_LAUNCH_COMMAND=1 to check out the command line of spark-submit.

All above applies to spark-shell too (as it uses spark-submit under the covers).

Upvotes: 1

Related Questions