Reputation: 390
I am trying to convert a SQL stored procedure to databricks notebook. In the stored procedure below 2 statements are to be implemented. Here the tables 1 and 2 are delta lake tables in databricks cluster.
I want to use a python variable in place of max_date in SQL query. How to do it?
%sql
DELETE FROM table1 WHERE Date = max_date;
INSERT INTO table1
SELECT * FROM table2 WHERE Date = max_date;
Upvotes: 5
Views: 29949
Reputation: 1377
Assuming you either calculate max_date
or use a widget to pass in the value like so:
max_date = dbutils.widgets.get("max_date")
You can use spark.sql()
with your SQL query in a Python string like this:
df = spark.sql(f"""
DELETE FROM table1 WHERE Date = '{max_date}';
INSERT INTO table1
SELECT * FROM table2 WHERE Date = '{max_date}';
""")
display(df)
It is easier and very readable to use f-string formatting to craft the SQL string as desired, then pass it to the builtin spark.sql() executor. The spark.sql() function will return a DataFrame with the results of the SQL query if needed.
UPDATE: Addressing a good comment!
Upvotes: 7
Reputation: 390
If you are using PySpark in databricks, then another way to use python variable in a Spark SQL query is below:
max_date = '2022-03-31'
df = spark.sql(f"""SELECT * FROM table2 WHERE Date = '{max_date}' """)
Here 'f' at the beginning of the query refers to 'format' which will let you use the variable inside PySpark SQL statement
Upvotes: 2
Reputation: 21
For a query use parameters:
base_query = """ SELECT col1, sum(col3) AS `Total` FROM table WHERE col1 = '{specific_ValforCol1}' AND `date_col` BETWEEN '{datemin}' AND '{datemax}' GROUP BY col1"""
Add a function to build a custom query based on base query:
def build_query(variables: dict, sql_query: str = query) -> str:
return sql_query.format(**variables)
Create a dictionary and call using f string:
vals = {"specific_ValforCol1": "7b5d", "datemin": "2022-04-19", "datemax": "2022-10-03"}
df = spark.sql(f"{(build_query(variables= vals , sql_query=base_query))}").toPandas()
This worked for me.
Upvotes: 1
Reputation: 1459
If you are going to run it cell by cell then you can use databricks widgets like
First cell
x=str(datetime.date.today())
dbutils.widgets.text("max_date",x)
Second cell
%sql
select getArgument("max_date") AS max_date
will give you
max_date
2022-06-04
but as mentioned here it does not work when run all is used and ideal way will be to create separate language based notebook and pass variables using %run
Other way is to use spark conf like below
First set a value for the conf [Note-the conf name should have .(dot) in it]
max_date2=str(datetime.date.today())
spark.conf.set("abc.max_dt2", max_date2)
Next try selecting value like below
%sql
select "${abc.max_dt2}" as max_date
It should give same value as above
Upvotes: 11