jfeigenbaum
jfeigenbaum

Reputation: 423

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.

But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.

There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.

I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.

Upvotes: 8

Views: 2891

Answers (3)

stevec
stevec

Reputation: 52917

To use raw SQL in chained dplyr functions

You can use sql() which runs whatever raw SQL you provide.

Example

Here the lubridate equivalent doesn't work on a database backend.

So instead I place custom SQL code sql("EXTRACT(WEEK FROM meeting_date)") inside sql(), like so:

your_dbplyr_object %>%
  mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))

To display raw SQL generated by dplyr / dbplyr

Sometimes you want to see the SQL generated via dbplyr. To do so, instead of piping to collect(), pipe to dbplyr::sql_render().

From docs:

sql_build() creates a select_query S3 object, that is rendered to a SQL string by sql_render()

To execute a raw SQL query, use odbc::dbGetQuery()

library(odbc)

# assume 'con' is your db connection object
dbGetQuery(con, "SELECT * FROM users;")
# will execute SQL and print results to console

Upvotes: 3

dsz
dsz

Reputation: 5212

Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function that's available in SQL-land.

This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the same code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.

Upvotes: 1

Simon.S.A.
Simon.S.A.

Reputation: 6941

You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:

union_all = function(table_a,table_b, list_of_columns){
  # extract database connection
  connection = table_a$src$con

  sql_query = build_sql(con = connection,
                      sql_render(table_a),
                      "\nUNION ALL\n",
                      sql_render(table_b)
  )

  return(tbl(connection, sql(sql_query)))
}

unioned_table = union_all(table_1, table_2, c("who", "where", "when"))

Two key commands here are:

  • sql_render, which takes a dbplyr table and returns the SQL code that produces it
  • build_sql, which assembles a query from strings.

You have choices for your execution command:

  • tbl(connection, sql(sql_query)) will return the resulting table
  • dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)

Upvotes: 7

Related Questions