Reputation: 31
I have to compare the table in databricks with the same table in SQL Server and populate only the missing records into databricks. Can someone help me how to connect to SQL Server with databricks , how and where to write the query that will populate the missing data.
Thanks!
Upvotes: 1
Views: 1024
Reputation: 87249
You can connect to SQL server just using standard JDBC interface supported by Spark - databricks runtimes should contain the drivers for MS SQL out of box. When data read you can do anti join between SQL server data and your data in Databricks. Something like this (in Python):
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};database={database}"
sql_data = spark.read.jdbc(jdbc_url, "your_table", connectionProperties)
your_data = spark.read.format("delta").load("path")
# get difference between datasets
diff = sql_data.join(your_data, <join-condition>, "leftanti")
# work with diff
For reading from SQL Server, follow instructions on how to optimize read performance, but this may depend on your actual schema.
Upvotes: 1