Mel
Mel

Reputation: 750

How to use SparkR::read.jdbc() or sparklyr::spark_read_jdbc() to get results of SQL query rather than whole table?

I usually use RODBC locally to query my databases. However our company has recently moved to Azure Databricks which does not inherently support RODBC or other odbc connections, but does support jdbc connections which I have not previously used.

I have read the documentation for SparkR::read.jdbc() and sparklyr::spark_read_jdbc() but these seem to pull an entire table from the database rather than just the results of a query, which is not suitable for me as I never have to pull whole tables and instead run queries that join multiple tables together but only return a very small subset of the data in each table.

I cannot find a method for using the jdbc connector to:

(A) run a query referring to multiple tables on the same database

and

(B) store the results as an R dataframe or something that can very easily be converted to an R dataframe (such as a SparkR or sparklyr dataframe).

If possible, the solution would also only require me to specify the connection credentials once per script/notebook rather than every time I connect to the database to run a query and store the results as a dataframe.

e.g. is there a jdbc equivalent of the following:

my_server="myserver.database.windows.net"
my_db="mydatabase"
my_username="database_user"
my_pwd="abc123Ineedabetterpassword"


myconnection <- RODBC::odbcDriverConnect(paste0("DRIVER={SQL Server};
                                 server=",my_server,";
                                 database=",my_db,";
                                 uid=",my_username,";
                                 pwd=",my_pwd))

df <- RODBC::sqlQuery(myconnection, 
"SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a 
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2
")

df2 <- RODBC::sqlQuery(myconnection,
"SELECT x.var1, y.var2, z.var3
FROM table_x as x
LEFT JOIN table_y as y on x.id = y.x_id
LEFT JOIN table_z on as z on x.id = z.x_id
WHERE z.category like '%vegetable%'
AND y.category IN ('A', 'B', 'C')
“)

How would I do something that gives the same results (two R dataframes df and df2) as the above using the jdbc connectors from SparkR or sparklyr inbuilt in Databricks?

I know that I can use the spark connector and some scala code (https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector) to store the query results as a spark dataframe, convert this to a global temp table, store the global temp table as a SparkR dataframe and collapse this to an R dataframe, but this code is very difficult to read, requires me to change the language to scala (which I do not know well) for one of the cells in my notebook, and takes a really long time due to the large amount of steps. Because my R script often starts with several SQL queries -- often to multiple different databases -- this method gets very time-consuming and makes my scripts almost unreadable. Surely there is a more straightforward way?

(We are using Databricks primarily for automation via LogicApps and Azure Data Factory, and occasionally for increased RAM, rather than for parallel processing; our data (once extracted) are generally not large enough to require parallelisation and some of the models we use (e.g. lme4::lmer()) do not benefit from it.)

Upvotes: 2

Views: 2097

Answers (1)

Mel
Mel

Reputation: 750

I worked this out eventually and want to post the answer here in case anyone else is having issues.

You can use SparkR::read.jdbc() with a query but you must surround the query in brackets and alias the results as something, otherwise you will get an ambiguous syntax error. The "portnum" seems to work fine for me as the default 1433 but if you have a different kind of SQL database you might need to change this in the URL. Then you can call SparkR::collect() on the SparkDataFrame containing the query results to convert it to an R dataframe:

e.g.

myconnection <- "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydatabase;user=database_user;password=abc123Ineedabetterpassword"

df <- read.jdbc( myconnection, "(
SELECT a.var1, b.var2, SUM(c.var3) AS Total_Things, AVG(d.var4) AS Mean_Stuff
FROM table_A as a 
JOIN table_B as b on a.id = b.a_id
JOIN table_C as c on a.id = c.a_id
JOIN table_D as d on c.id = d.c_id
Where a.filter_var IN (1, 2, 3, 4)
AND d.filter_var LIKE '%potatoes%'
GROUP BY
a.var1, b.var2) as result" ) %>% 
SparkR::collect()

Upvotes: 3

Related Questions