wilson_smyth
wilson_smyth

Reputation: 1496

azure synapse: connecting to serverless sql pool from databricks - Failed to find data source: com.databricks.spark.sqldw

Im using synapse in azure. I have data in the serverless sql pool. I want to import that data to a dataframe in databricks.

I am getting the following error:

Py4JJavaError: An error occurred while calling o568.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.sqldw. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:195)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
    at sun.reflect.GeneratedMethodAccessor102.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.databricks.spark.sqldw.DefaultSource
...
...
...

The pyspark code i am using is:

spark.conf.set(
  "fs.azure.account.key.adlsAcct.blob.core.windows.net",
  "GVk3234fds2JX/fahOcjig3gNy198yasdhfkjasdyf87HWmDVlx1wLRmu7asdfaP3g==")



sc._jsc.hadoopConfiguration().set(
  "fs.azure.account.key.adlsAcct.blob.core.windows.net",
  "GVk3234fds2JX/fahOcjig3gNy198yasdhfkjasdyf87HWmDVlx1wLRmu7asdfaP3g==")


  
df = spark.read \
.format("com.databricks.spark.sqldw") \
.option("url","jdbc:sqlserver://synapse-myworkspace-ondemand.sql.azuresynapse.net:1433;database=myDB;user=myUser;password=userPass123;encrypt=false;trustServerCertificate=true;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;") \
.option("tempdir", "wasbs://[email protected]/Lakehouse/tempDir") \
.option("forwardSparkAzureStorageCredentials","true") \
.option("dbtble","tbl_sampledata") \
.load()

I can confirm:

To my eye, the error looks like databricks cannot find the format com.databricks.spark.sqldw, but that could be a red herring.

appreciate any advise and expertise

Upvotes: 1

Views: 2856

Answers (1)

wBob
wBob

Reputation: 14379

One of the advantages of working Azure Synapse Analytics is integration, in that the various components of storage, database, pipeline, notebook etc tend to work together a bit easier than setting up the standalone components, eg Databricks notebook, where you have to write code like yours, including hadoopConfiguration etc

One simple way to getting data from a dedicated SQL pool to a Synapse notebook is using the synapsesql method. A simple example:

%%spark
// Get the table with synapsesql method and expose as temp view
val df = spark.read.synapsesql("dedi_pool.dbo.someTable")

df.createOrReplaceTempView("someTable")

Unfortunately this method is only implemented in Scala at the moment (as far as I am aware) but you can save the dataframe as a temp view which exposes it to SparkSQL and Python:

%%sql
SELECT * FROM someTable;

And here's the Python to retrieve the temp view in Python:

%%pyspark
## Get the table with synapsesql method and expose as temp view
df = spark.sql("select * from someTable")

df.show()

And here's my results:

results

Check the main documentation for this technique here.

For serverless SQL pools, I was initially frustrated by this not being built-in, but then if you think about it, you would be using duplicate services, ie the serverless engine to query the underlying files and the Apache Spark pool to query that placeholder / external table, which is effectively querying they underlying files. So you might as well reference the files directly using spark.read and whatever the fileformat is, eg .csv. Example taken from the docs:

%%pyspark
df = spark.read.load('abfss://[email protected]/NYCTripSmall.parquet', format='parquet')
display(df.limit(10))

This is what I was thinking: parquet serverless and apache spark pools

It is however technically possible and I covered the technique here.

Upvotes: 2

Related Questions