Reputation: 379
I am trying to use the spark_apply() function from library(sparklyr) I am using the spark_apply() function because the sparklyr package does not support using subsets. I am a bit lost about where I need to include the function(e) within the following dplyr syntax.
Here is the original syntax I am trying to adapt with an anonymous function (I'm not 100% this is the term)
match_cat3 <- match_cat2 %>%
group_by(VarE, VarF) %>%
mutate(Var_G = if(any(Var_C ==1)) ((VarG - VarG[Var_C ==
1])/(Var_G + Var_G[Var_C == 1])/2) else NA)
Here is my attempt at using the spark_apply() function with the mutate equation from above. I would love some help with how to use the function(e) and where the e goes within the syntax. I don't have any experience using a function within another function like this.
match_cat3 <- spark_apply(
function(e)
match_cat2 %>%
group_by(e$VarE, e$VarF) %>%
mutate(e$Var_G = if(any(e$Var_C ==1)) ((e$VarG -
e$VarG[e$Var_C == 1])/(e$Var_G + e$Var_G[e$Var_C == 1])/2) else NA, e)
)
``` This gives me an out of bounds error.
I was basing the syntax off of the following block from the spark_apply() documentation.
trees_tbl %>%
spark_apply(
function(e) data.frame(2.54 * e$Girth, e),
names = c("Girth(cm)", colnames(trees)))
Thanks!
Upvotes: 0
Views: 289
Reputation: 152
You seem to be having trouble writing a sparklyr::spark_apply()
function. The template that might be more useful for you starts with your Spark DataFrame.
##### data_sf is a Spark DataFrame that will be sent to all workers for R
data_sf <- sparklyr::copy_to(sc, iris, overwrite = TRUE)
data2_sf <- sparklyr::spark_apply(
x = data_sf,
f = function(x) { ##### data_sf will be the argument passed to this x parameter
x$Petal_Length <- x$Petal_Length + 10 ##### data_sf will now be converted to an R object used here (Spark doesn't like `Petal.Length` so automatically changes column names)
return(x)
})
In your case:
x
argument, the first in sparklyr::spark_apply()
match_cat2
) through the e
argument of your anonymous function but improperly putting it inside the definition of the function as welldplyr
(and magrittr
) with wrong syntax--you can refer to variables like group_by(VarE)
not group_by(e$VarE)
Functions are defined as
function(data, context) {}
where you can provide arbitrary code within the{}
. Chapter 11.7 Functions
ifelse()
function here) but I'm not sure what your intent is##### Rewritten, maybe helpful?
match_cat3 <- spark_apply(
x = match_cat2, ##### the Spark DataFrame you give to spark_apply()
function(e) { ##### the opening bracket
e %>% ##### the function's argument, NOT `match_cat2 %>%`
group_by(VarE, VarF) %>% ##### remove `e$`
mutate(Var_G = something_good) ##### not sure of your intent here
})
Upvotes: 1