Pyspark: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark (SQL Data Pool)

I'm trying to load streaming data from Kafka into SQL Server Big Data Clusters Data Pools. I'm using Spark 2.4.5 (Bitnami 2.4.5 spark image).

If I want to load data into regular tables, I use this sentence and it goes well:

logs_df.write.format('jdbc').mode('append').option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver').option \
        ('url', 'jdbc:sqlserver://XXX.XXX.XXX.XXXX:31433;databaseName=sales;').option('user', user).option \
        ('password', password).option('dbtable', 'SYSLOG_TEST_TABLE').save()

But the same sentence to load data into SQL Data Pool gives me this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 3, localhost, executor driver): java.sql.BatchUpdateException: External Data Pool Table DML statement cannot be used inside a user transaction.

I found that the way to load data into SQL Data Pool is to use 'com.microsoft.sqlserver.jdbc.spark' format, as this:

logs_df.write.format('com.microsoft.sqlserver.jdbc.spark').mode('append').option('url', url).option('dbtable', datapool_table).option('user', user).option('password', password).option('dataPoolDataSource',datasource_name).save()

But it's giving me this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o93.save.
: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark. Please find packages at http://spark.apache.org/third-party-projects.html

I'm running the script with spark-submit like this:

docker exec spark245_spark_1 /opt/bitnami/spark/bin/spark-submit --driver-class-path /opt/bitnami/spark/jars/mssql-jdbc-8.2.2.jre8.jar --jars /opt/bitnami/spark/jars/mssql-jdbc-8.2.2.jre8.jar --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 /storage/scripts/some_script.py

Is there any other package I should include or some special import I'm missing?

Thanks in advance

Edited: I've tried in scala with same results

Upvotes: 3

Answers (2)

Pblade

Reputation: 160

You need to build the repository into the jar file first using SBT. Then include it to your spark cluster.

I know there will be a lot of people having trouble with buiding this jar file (include myself of several hours ago), so I will guide you how to build the jar file, step by step:

Go to https://www.scala-sbt.org/download.html to download SBT, then install it.
Go to https://github.com/microsoft/sql-spark-connector and download the zip file.
Open the folder of the repository you have just downloaded, right click in the blank space and click "Open PowerShell windows here" . https://i.sstatic.net/Fq7NX.png
In the Shell windows, type "sbt" then press enter. It may require you to download the Java Development Kit. If so, go to https://www.oracle.com/java/technologies/javase-downloads.html to download and install it. You may need to close and reopen the shell windows after installing.

If things go right, you may see this screen: https://i.sstatic.net/fMxVr.png

After the above step has done its job, type "package". The shell may show you something like this, and it may take you a long time to finish the job. https://i.sstatic.net/hr2hw.png
After the build is done, go to the "target" folder, then "scala-2.11" folder to get the jar file. https://i.sstatic.net/Aziqy.png
After you got the jar file, include it to the Spark cluster.

~~OR, if you don't want to do the troublesome procedures above....~~

UPDATE MAY 26, 2021: The connector is now available in Maven, so you can just go there and do the rest.

https://mvnrepository.com/artifact/com.microsoft.azure/spark-mssql-connector

If you need more information, just comment. I will try my best to help.

Upvotes: 2

J.Cheuk

Reputation: 93

According to the documentation: "To include the connector in your projects, download this repository and build the jar using SBT."

So you need to build the connector JAR file using the build.sbt in the repository, then put the JAR file into spark: your_path\spark\jars

To do this, download the SBT here: https://www.scala-sbt.org/download.html. Open SBT in the directory where you saved the build.sbt then run sbt package. A target folder should be created in the same directory and the JAR file is located in target\scala-2.11

Upvotes: 0

Pyspark: java.lang.ClassNotFoundException: Failed to find data source: com.microsoft.sqlserver.jdbc.spark (SQL Data Pool)

Answers (2)

Related Questions