Reputation: 641
A reproducible example (adapted from @forestfanjoe's answer):
library(dplyr)
library(sparklyr)
sc <- spark_connect(master = "local")
df <- data.frame(id = 1:100, PaymentHistory = runif(n = 100, min = -1, max = 2))
df <- copy_to(sc, df, "payment")
> head(df)
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 -0.138
2 2 -0.249
3 3 -0.805
4 4 1.30
5 5 1.54
6 6 0.936
fix_PaymentHistory <- function(df){df %>% dplyr::mutate(PaymentHistory = dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > 1,1, PaymentHistory)))}
df %>% fix_PaymentHistory
The error is:
Error in dplyr::if_else(PaymentHistory < 0, 0, dplyr::if_else(PaymentHistory > :
object 'PaymentHistory' not found
I'm using the scope operator because I'm afraid that the name in dplyr
will clash with some of the user-defined code. Note that PaymentHistory
is a column variable in df
.
The same error is not present when running the following code:
fix_PaymentHistory <- function(df){
df %>% mutate(PaymentHistory = if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
}
> df %>% fix_PaymentHistory
# Source: spark<?> [?? x 2]
id PaymentHistory
* <int> <dbl>
1 1 0
2 2 0
3 3 0
4 4 1
5 5 1
6 6 0.936
7 7 0
8 8 0.716
9 9 0
10 10 0.0831
# ... with more rows
Upvotes: 2
Views: 515
Reputation: 330163
TL;DR Because your code doesn't use dplyr::if_else
at all.
sparklyr
, when used as in the example, treats Spark as yet another database and issues queries using dbplyr
SQL translation layer.
In this context if_else
is no treated as a function, but an identifier which is converted to SQL primitives:
dbplyr::translate_sql(if_else(PaymentHistory < 0, 0,if_else(PaymentHistory > 1,1, PaymentHistory)))
# <SQL> CASE WHEN ("PaymentHistory" < 0.0) THEN (0.0) WHEN NOT("PaymentHistory" < 0.0) THEN (CASE WHEN ("PaymentHistory" > 1.0) THEN (1.0) WHEN NOT("PaymentHistory" > 1.0) THEN ("PaymentHistory") END) END
However if you pass a fully qualified named, it will circumvent this mechanism, try to evaluate the function, and ultimately fail, because the database columns are not in the scope.
I'm afraid that the name in dplyr will clash with some of the user-defined code.
As you see, there is no need for dplyr to be in scope here at all - functions called in sparklyr
pipelines are either translated to corresponding SQL constructs, or if there is no specific translation rule in place, passed as-is and resolved by Spark SQL engine (this path is used to invoke Spark functions).
Of course this mechanism is not specific to sparklyr
and you're likely to see the same behavior using other table backed by a database:
library(magrittr)
db <- dplyr::src_sqlite(":memory:", TRUE)
dplyr::copy_to(db, mtcars)
db %>% dplyr::tbl("mtcars") %>% dplyr::mutate(dplyr::if_else(mpg < 20, 1, 0))
Error in dplyr::if_else(mpg < 20, 1, 0) : object 'mpg' not found
Upvotes: 5