Does dbplyr currently assume a redshift connection is actually a postgres connection?

Question

Having happily played around with dbplyr on redshift to do basic things, I'm finding it difficult to perform more complex analysis, and I am wondering if this is due to the current tool set of dbplyr, or my mistake.

A typlical mwe:

library(dplyr)
library(dbplyr)
library(forcats)

redshift <- dbConnect(driver, url) #  << mytable

myTable %>% colnames() # This returns the correct colnames, great, working connection!

myTable %>% mutate_all(as_factor) # This gives an error

The error in this case is that dbplyr has sent sql to the database such as this:

SELECT AS_FACTOR("col_1") AS "col_1", AS_FACTOR("col_2") AS "col_2"

And so the error returned is:

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  : 
  Unable to retrieve JDBC result set for SELECT AS_FACTOR("col_1") AS "col_1", AS_FACTOR("col_2") AS "col_2"....

However, as far as I can work out, Redshift has no AS_FACTOR() function. In this case I would have expected the behavior to pull the data into my local session, and processed it as a factor within R, however it seems to have not registered that a) Redshift doesn't have that function, b) Redshift doesn't support that data type (I believe, though willing to be corrected), c) that the intent was to pull data in and create the factor.

This seems to be supported by the file here, which bumps Redshift into the postgres environment, even though Redshift is far more limited in scope than postgres.

My specific questions are:

Is my process of deduction correct as to the error?
What is the workaround for this, is the process to collect my results before this stage, and then work local, or have I missed something?

edgararuiz · Accepted Answer

the challenge is to map the R variable types to proper database types. Today, dbplyr translates as.numeric(), as.double(), as.integer(), and as.character(). Any function that dbplyr doesn't recognize is passed to the database verbatim, that's why the result was as_factor(). I may be wrong, but Redshift does not have factorized variables, so maybe using as.character() is a better option. That the one I use coercing into categorical data when working with databases. I encourage you to use one of the current translations, and avoid using collect(), specially on a very large data set.

Does dbplyr currently assume a redshift connection is actually a postgres connection?

Answers (1)

Related Questions