jfeigenbaum
jfeigenbaum

Reputation: 413

How to use custom SQL function in dbplyr?

I would like to calculate the Jaro-Winkler string distance in a database. If I bring the data into R (with collect) I can easily use the stringdist function from the stringdist package.

But my data is very large and I'd like to filter on Jaro-Winkler distances before pulling the data into R.

There is SQL code for Jaro-Winkler (https://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ and a version for T-SQL) but I guess I'm not sure how best to get that SQL code to work with dbplyr. I'm happy to try and map the stringdist function to the Jaro-Winkler sql code but I don't know where to start on that. But even something simpler like executing the SQL code directly from R on the remote data would be great.

I had hoped that SQL translation in the dbplyr documentation might help, but I don't think so.

Upvotes: 7

Views: 2847

Answers (3)

stevec
stevec

Reputation: 52268

You can use sql() which runs whatever raw SQL you provide.

Example

Here the lubridate equivalent doesn't work on a database backend.

So instead I place custom SQL code sql("EXTRACT(WEEK FROM meeting_date)") inside sql(), like so:

your_dbplyr_object %>%
  mutate(week = sql("EXTRACT(WEEK FROM meeting_date)"))

One other thing, this is the top search result when searching for how to render the raw sql using dbplyr. In case anyone arrived here from that search, the answer is to pipe to dbplyr::sql_render(). From the docs:

sql_build() creates a select_query S3 object, that is rendered to a SQL string by sql_render()

Upvotes: 3

dsz
dsz

Reputation: 5202

Alternatively, find a way to define the function in SQL as a user-defined function, you can then simply use the name of that function as if it were an R function (in a dbplyr query). When R can't find the function locally, it simply passes it to the SQL back-end and assumes it will be a function that's available in SQL-land.

This is a great way to decouple the logic. Down side is that the dbplyr expression is now dependant on the db-backend; you can't run the same code on a local data set. One way around that is to create a UDF that mimics an existing R function. The dplyr will use the local R and dbplyr will use the SQL UDF.

Upvotes: 1

Simon.S.A.
Simon.S.A.

Reputation: 6931

You can build your own SQL functions in R. They just have to produce a string that is a valid SQL query. I don't know the Jaro-Winkler distance, but I can provide an example for you to build from:

union_all = function(table_a,table_b, list_of_columns){
  # extract database connection
  connection = table_a$src$con

  sql_query = build_sql(con = connection,
                      sql_render(table_a),
                      "\nUNION ALL\n",
                      sql_render(table_b)
  )

  return(tbl(connection, sql(sql_query)))
}

unioned_table = union_all(table_1, table_2, c("who", "where", "when"))

Two key commands here are:

  • sql_render, which takes a dbplyr table and returns the SQL code that produces it
  • build_sql, which assembles a query from strings.

You have choices for your execution command:

  • tbl(connection, sql(sql_query)) will return the resulting table
  • dbExecute(db_connection, as.character(sql_query)) will execute a query without returning the result (useful for for dropping tables, creating indexes, etc.)

Upvotes: 7

Related Questions