Reputation: 713
I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.
Let's say I have two tables, tableSrc
and tableBuilder
, and I'm creating tableDest
.
I've been trying variants on
SET myVar FLOAT = NULL
SELECT
myVar = avg(myCol)
FROM tableSrc;
CREATE TABLE tableDest(
refKey INT,
derivedValue FLOAT
);
INSERT INTO tableDest
SELECT
refKey,
neededValue * myVar AS `derivedValue`
FROM tableBuilder
Doing this in T-SQL is trivial, in a surprising win for Microsoft (DECLARE
...SELECT
). Spark, however, throws
Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)
but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement.
Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert {var1}
, {var2}
, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.
September 2024 Update:
Databricks Runtime 14.1 and higher now properly supports variables.
-- DBR 14.1+
DECLARE VARIABLE dataSourceStr STRING = "foobar";
SELECT * FROM hive_metastore.mySchema.myTable WHERE dataSource = dataSourceStr;
-- Returns where dataSource column is 'foobar'
Upvotes: 16
Views: 75755
Reputation: 4569
One possible way to go is using CTE SQL syntax. This is extremly useful in general, and can solve your original question like this:
-- Create your table
CREATE TABLE IF NOT EXISTS tableDest(
refKey INT,
derivedValue FLOAT
);
-- CTE expression with insert statement
WITH calc AS ( SELECT
avg(myCol) myVar
FROM tableSrc
)
INSERT INTO tableDest
SELECT
refKey,
neededValue * calc.myVar AS `derivedValue`
FROM tableBuilder, calc
Databricks seems to have good support for CTE expressions.
Upvotes: 0
Reputation: 11
declare and set syntax is now supported, but only on databricks runtime 14.1 and later
Upvotes: 1
Reputation: 1109
Databricks now has widgets for SQL also https://docs.databricks.com/notebooks/widgets.html#widgets-in-sql
CREATE WIDGET TEXT p_file_date DEFAULT "2021-03-21";
Select * from results where results.file_date = getArgument("p_file_date")
Upvotes: 2
Reputation: 781
Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like:
CREATE TEMP FUNCTION myVar()
RETURNS FLOAT
LANGUAGE SQL
RETURN
SELECT
avg(myCol)
FROM tableSrc;
And then for use:
SELECT
refKey,
neededValue * myVar() AS `derivedValue`
FROM tableBuilder
Upvotes: 10
Reputation: 51
I've circled around this issue for a long time. Finally, I've found a workaround using @Ronieri Marques solution plus some pyspark functions. I'll try to provide a full working code below:
first I create a sample table:
%sql
create table if not exists calendar
as
select '2021-01-01' as date
union
select '2021-01-02' as date
union
select '2021-01-03' as date
%sql
-- just to show the max and min dates
select max(date), min(date) from calendar
Combining sqlContext + toJSON it is possible to dynamically assign a value to the variable, in this case I use a query:
%python
result = sqlContext.sql("select max(date), min(date) from calendar").toJSON()
spark.conf.set('date.end' , result.first()[14:24])
spark.conf.set('date.start' , result.first()[39:49])
Finally it will be possible to use the variables inside a SQL query:
%sql
select * from calendar where date > '${date.start}' and date < '${date.end}'
Note that the substring result.first()[14:24] and result.first()[39:49] are necessary because the value of result.first() is {"max(date)":"2021-01-03","min(date)":"2021-01-01"} so we need to "tailor" the final result picking up only the values we need.
Probably the code can be polished but right now it is the only working solution I've managed to implement.
I hope this solution could be useful for someone.
Upvotes: 2
Reputation: 459
The SET command used is for spark.conf get/set, not a variable for SQL queries
For SQL queries you should use widgets:
https://docs.databricks.com/notebooks/widgets.html
But, there is a way of using spark.conf parameters on SQL:
%python spark.conf.set('personal.foo','bar')
Then you can use:
$sql select * from table where column = '${personal.foo}';
The trick part is that you have to use a "dot" (or other special character) on the name of the spark.conf, or SQL cells will expect you to provide value to the $variable on run time (It looks like a bug to me, i believe rounding with {} should be enough)
Upvotes: 27
Reputation: 462
You are missing a semi-colon at the end of the variable assignment.
SET myVar FLOAT = NULL;
...
Hope it helps :)
Upvotes: 1