Reputation: 1530
I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook. I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Upvotes: 0
Views: 407
Reputation: 2767
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition
Upvotes: 2