How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

Question

I usually use RODBC locally to query my databases. However our company has recently moved to Azure Databricks which does not inherently support RODBC or other odbc connections, but does support jdbc connections which I have not previously used.

I have read the documentation for SparkR::read.jdbc() and sparklyr::spark_read_jdbc() but these seem to pull an entire table from the database rather than just the results of a query, which is not suitable for me as I never have to pull whole tables and instead run queries that join multiple tables together but only return a very small subset of the data in each table.

I cannot find a method for using the jdbc connector to:

(A) run a query referring to multiple tables on the same database

and

(B) store the results as an R dataframe or something that can very easily be converted to an R dataframe (such as a SparkR or sparklyr dataframe).

If possible, the solution would also only require me to specify the connection credentials once per script/notebook rather than every time I connect to the database to run a query and store the results as a dataframe.

e.g. is there a jdbc equivalent of the following:

my_server="myserver.database.windows.net"
my_db="mydatabase"
my_username="database_user"
my_pwd="abc123Ineedabetterpassword"


myconnection <- RODBC::odbcDriverConnect(paste0("DRIVER={SQL Server};
                                 server=",my_server,";
                                 database=",my_db,";
                                 uid=",my_username,";
                                 pwd=",my_pwd))

df <- RODBC::sqlQuery(myconnection, 
"SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a 
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2
")

df2 <- RODBC::sqlQuery(myconnection,
"SELECT x.var1, y.var2, z.var3
FROM table_x as x
LEFT JOIN table_y as y on x.id = y.x_id
LEFT JOIN table_z on as z on x.id = z.x_id
WHERE z.category like '%vegetable%'
AND y.category IN ('A', 'B', 'C')
“)

How would I do something that gives the same results (two R dataframes df and df2) as the above using the jdbc connectors from SparkR or sparklyr inbuilt in Databricks?

I know that I can use the spark connector and some scala code (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector) to store the query results as a spark dataframe, convert this to a global temp table, store the global temp table as a SparkR dataframe and collapse this to an R dataframe, but this code is very difficult to read, requires me to change the language to scala (which I do not know well) for one of the cells in my notebook, and takes a really long time due to the large amount of steps. Because my R script often starts with several SQL queries -- often to multiple different databases -- this method gets very time-consuming and makes my scripts almost unreadable. Surely there is a more straightforward way?

(We are using Databricks primarily for automation via LogicApps and Azure Data Factory, and occasionally for increased RAM, rather than for parallel processing; our data (once extracted) are generally not large enough to require parallelisation and some of the models we use (e.g. lme4::lmer()) do not benefit from it.)

How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

Answers (1)

Related Questions