Compare data in a table in Databricks and MS SQL Server

Question

I have to compare the table in databricks with the same table in SQL Server and populate only the missing records into databricks. Can someone help me how to connect to SQL Server with databricks , how and where to write the query that will populate the missing data.

Thanks!

Alex Ott · Accepted Answer

You can connect to SQL server just using standard JDBC interface supported by Spark - databricks runtimes should contain the drivers for MS SQL out of box. When data read you can do anti join between SQL server data and your data in Databricks. Something like this (in Python):

jdbc_url = f"jdbc:sqlserver://{hostname}:{port};database={database}"
sql_data = spark.read.jdbc(jdbc_url, "your_table", connectionProperties)
your_data = spark.read.format("delta").load("path")
# get difference between datasets
diff = sql_data.join(your_data, , "leftanti")
# work with diff

For reading from SQL Server, follow instructions on how to optimize read performance, but this may depend on your actual schema.

Compare data in a table in Databricks and MS SQL Server

Answers (1)

Related Questions