Reputation: 390

How to use python variable in SQL Query in Databricks?

I am trying to convert a SQL stored procedure to databricks notebook. In the stored procedure below 2 statements are to be implemented. Here the tables 1 and 2 are delta lake tables in databricks cluster.

I want to use a python variable in place of max_date in SQL query. How to do it?

%sql
DELETE FROM table1 WHERE Date = max_date;
INSERT INTO table1
SELECT * FROM table2 WHERE Date = max_date;

Upvotes: 5

Answers (4)

e.thompsy

Reputation: 1377

Assuming you either calculate max_date or use a widget to pass in the value like so:

max_date = dbutils.widgets.get("max_date")

You can use spark.sql() with your SQL query in a Python string like this:

df = spark.sql(f"""
DELETE FROM table1 WHERE Date = '{max_date}';
INSERT INTO table1
SELECT * FROM table2 WHERE Date = '{max_date}';
""")

display(df)

It is easier and very readable to use f-string formatting to craft the SQL string as desired, then pass it to the builtin spark.sql() executor. The spark.sql() function will return a DataFrame with the results of the SQL query if needed.

UPDATE: Addressing a good comment!

Upvotes: 7

Mikesama

Reputation: 390

If you are using PySpark in databricks, then another way to use python variable in a Spark SQL query is below:

max_date = '2022-03-31'
df = spark.sql(f"""SELECT * FROM table2 WHERE Date = '{max_date}' """)

Here 'f' at the beginning of the query refers to 'format' which will let you use the variable inside PySpark SQL statement

Upvotes: 2

Ganesh Narasimhan

Reputation: 21

For a query use parameters:

base_query = """ SELECT col1,  sum(col3) AS `Total` FROM table WHERE col1 = '{specific_ValforCol1}' AND `date_col` BETWEEN '{datemin}' AND '{datemax}' GROUP BY col1"""

Add a function to build a custom query based on base query:

def build_query(variables: dict, sql_query: str = query) -> str:
     return sql_query.format(**variables)

Create a dictionary and call using f string:

vals = {"specific_ValforCol1": "7b5d", "datemin": "2022-04-19", "datemax": "2022-10-03"}

df = spark.sql(f"{(build_query(variables= vals , sql_query=base_query))}").toPandas()

This worked for me.

Upvotes: 1

Anjaneya Tripathi

Reputation: 1459

If you are going to run it cell by cell then you can use databricks widgets like

First cell

x=str(datetime.date.today())
dbutils.widgets.text("max_date",x)

Second cell

%sql
select getArgument("max_date") AS max_date

will give you

max_date
2022-06-04

but as mentioned here it does not work when run all is used and ideal way will be to create separate language based notebook and pass variables using %run

Other way is to use spark conf like below

First set a value for the conf [Note-the conf name should have .(dot) in it]

max_date2=str(datetime.date.today())
spark.conf.set("abc.max_dt2", max_date2)

Next try selecting value like below

%sql
select "${abc.max_dt2}" as max_date

It should give same value as above

Upvotes: 11

How to use python variable in SQL Query in Databricks?

Answers (4)

Related Questions