Kreitz Gigs
Kreitz Gigs

Reputation: 379

Anonymous function in R using sparklyr spark_apply

I am trying to use the spark_apply() function from library(sparklyr) I am using the spark_apply() function because the sparklyr package does not support using subsets. I am a bit lost about where I need to include the function(e) within the following dplyr syntax.

Here is the original syntax I am trying to adapt with an anonymous function (I'm not 100% this is the term)

match_cat3 <- match_cat2 %>%
          group_by(VarE, VarF) %>%
          mutate(Var_G = if(any(Var_C ==1)) ((VarG - VarG[Var_C == 
1])/(Var_G + Var_G[Var_C == 1])/2) else NA)

Here is my attempt at using the spark_apply() function with the mutate equation from above. I would love some help with how to use the function(e) and where the e goes within the syntax. I don't have any experience using a function within another function like this.

match_cat3 <- spark_apply(
                    function(e)
                    match_cat2 %>%
                    group_by(e$VarE, e$VarF) %>%
                    mutate(e$Var_G = if(any(e$Var_C ==1)) ((e$VarG - 
e$VarG[e$Var_C == 1])/(e$Var_G + e$Var_G[e$Var_C == 1])/2) else NA, e)
)

``` This gives me an out of bounds error.

I was basing the syntax off of the following block from the spark_apply() documentation.

trees_tbl %>%
spark_apply(
function(e) data.frame(2.54 * e$Girth, e),
names = c("Girth(cm)", colnames(trees)))

Thanks!

Upvotes: 0

Views: 289

Answers (1)

josephD
josephD

Reputation: 152

You seem to be having trouble writing a sparklyr::spark_apply() function. The template that might be more useful for you starts with your Spark DataFrame.

##### data_sf is a Spark DataFrame that will be sent to all workers for R
data_sf <- sparklyr::copy_to(sc, iris, overwrite = TRUE)

data2_sf <- sparklyr::spark_apply(
  x = data_sf,
  f = function(x) {  ##### data_sf will be the argument passed to this x parameter
    x$Petal_Length <- x$Petal_Length + 10 ##### data_sf will now be converted to an R object used here (Spark doesn't like `Petal.Length` so automatically changes column names)
    return(x)
  })

In your case:

  • you're missing the x argument, the first in sparklyr::spark_apply()
  • you're bringing in external stuff (match_cat2) through the e argument of your anonymous function but improperly putting it inside the definition of the function as well
  • you're missing brackets around your multiline expression and so you aren't defining a function
  • you're trying to use dplyr (and magrittr) with wrong syntax--you can refer to variables like group_by(VarE) not group_by(e$VarE)

Functions are defined as function(data, context) {} where you can provide arbitrary code within the {}. Chapter 11.7 Functions

  • you're trying to do some conditional stuff in your if else (you could also use the ifelse() function here) but I'm not sure what your intent is
##### Rewritten, maybe helpful?
match_cat3 <- spark_apply(
  x = match_cat2, ##### the Spark DataFrame you give to spark_apply()
                    function(e) { ##### the opening bracket
                    e %>% ##### the function's argument, NOT `match_cat2 %>%`
                    group_by(VarE, VarF) %>% ##### remove `e$`
                    mutate(Var_G = something_good) ##### not sure of your intent here
})

Upvotes: 1

Related Questions