Reputation: 4454
I have an Azure Databricks cluster that processes various tables and then as a final step I push these table into an Azure SQL Server to be used by some other processes. I have a cell in databricks that looks something like this:
def generate_connection():
jdbcUsername = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlUserName")
jdbcPassword = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlPassword")
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
return connectionProperties
def generate_url():
jdbcHostname = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlHostName")
jdbcDatabase = dbutils.secrets.get(scope = "Azure-Key-Vault-Scope", key = "AzureSqlDatabase")
jdbcPort = 1433
return "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
def persist_table(table, sql_table, mode):
jdbcUrl = generate_url();
connectionProperties = generate_connection()
table.write.jdbc(jdbcUrl, sql_table, properties=connectionProperties, mode=mode)
persist_table(spark.table("Sales.OpenOrders"), "Sales.OpenOrders", "overwrite")
persist_table(spark.table("Sales.Orders"), "Sales.Orders", "overwrite")
This works as expected. The problem that I have is that the Orders table is very large and only a small fraction of the rows could possible change each day, so what I want to do is change the overwrite mode to the append mode and change the data frame from being the entire table to just the rows that could have changed. All of this I know how to do easily enough, but what I want to do is run a simple SQL statement against the Azure SQL database to remove the rows that are already going to be there, so that they possibly changed rows will be inserted back.
I want to run a SQL statement against the Azure SQL database like
Delete From Sales.Orders Where CreateDate >= '01/01/2019'
Upvotes: 1
Views: 8885
Reputation: 311
I want to share my findings.
1) pyodbc - I asked a Microsoft Technical Support and got the answer is as follows:
####========================================================
### cell 1: install pyodbc
####========================================================
%sh
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc
####========================================================
### cell 2: connect
####========================================================
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=xxxxx.database.windows.net;'
'DATABASE=xxxxx;UID=xxxxx;'
'PWD=xxxxx')
####========================================================
### cell 3: create table
####========================================================
conn.execute('CREATE TABLE dbo.Bob (Bob1 VARCHAR(30), Bob2 VARCHAR(30))')
####========================================================
### cell 4: insert into table
####========================================================
conn.execute('INSERT INTO dbo.Bob (Bob1, Bob2) VALUES (?, ?)', ('A', 'B'))
Notes: a) Runtime: 6.2 (Scala 2.11, Spark 2.4.4) b) This Runtime version supports only Python 3.
2) Spark connector for Azure SQL Database and SQL Server - While googling a solution for installing pyodbc, I found this one. I like this one better and am going to try it out.
Upvotes: 1
Reputation: 2473
You need to use the pyodbc library. You can connect and use sql statements.
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
# Example doing a simple execute
conn.execute('INSERT INTO Bob (Bob1, Bob2) VALUES (?, ?)', ('A', 'B'))
Unfortunately to get it working on databricks is a bit of a pain. I wrote a blog post a while back which should help. https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
Upvotes: 0