Reputation: 1480
Having happily played around with dbplyr on redshift to do basic things, I'm finding it difficult to perform more complex analysis, and I am wondering if this is due to the current tool set of dbplyr, or my mistake.
A typlical mwe:
library(dplyr)
library(dbplyr)
library(forcats)
redshift <- dbConnect(driver, url) # <<<obviously put in specific details here
tbl(redshift, "table") -> mytable
myTable %>% colnames() # This returns the correct colnames, great, working connection!
myTable %>% mutate_all(as_factor) # This gives an error
The error in this case is that dbplyr has sent sql to the database such as this:
SELECT AS_FACTOR("col_1") AS "col_1", AS_FACTOR("col_2") AS "col_2"
And so the error returned is:
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT AS_FACTOR("col_1") AS "col_1", AS_FACTOR("col_2") AS "col_2"....
However, as far as I can work out, Redshift has no AS_FACTOR()
function. In this case I would have expected the behavior to pull the data into my local session, and processed it as a factor within R, however it seems to have not registered that a) Redshift doesn't have that function, b) Redshift doesn't support that data type (I believe, though willing to be corrected), c) that the intent was to pull data in and create the factor.
This seems to be supported by the file here, which bumps Redshift into the postgres environment, even though Redshift is far more limited in scope than postgres.
My specific questions are:
Upvotes: 3
Views: 420
Reputation: 675
the challenge is to map the R variable types to proper database types. Today, dbplyr
translates as.numeric()
, as.double()
, as.integer()
, and as.character()
. Any function that dbplyr
doesn't recognize is passed to the database verbatim, that's why the result was as_factor()
. I may be wrong, but Redshift does not have factorized variables, so maybe using as.character()
is a better option. That the one I use coercing into categorical data when working with databases. I encourage you to use one of the current translations, and avoid using collect()
, specially on a very large data set.
Upvotes: 2