Tidymodels + Spark

Question

I'm trying to develop a simple logistic regression model using Tidymodels with the Spark engine. My code works fine when I specify set_engine = "glm", but fails when I attempt to set the engine to spark. Any advice would be much appreciated!

library(tidyverse)
library(sparklyr)
library(tidymodels)

train.df <- titanic::titanic_train

train.df <- train.df %>% 
  mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
         Sex = factor(Sex),
         Pclass = factor(Pclass))

skimr::skim(train.df)

# Just working with Spark locally.

sc <- spark_connect(master = 'local', version = '3.1')

train.spark.df <- copy_to(sc, train.df)

logistic.regression.recipe <- 
  recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
  update_role(PassengerId, new_role = 'ID') %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_impute_linear(all_predictors())

logistic.regression.recipe
summary(logistic.regression.recipe)

logistic.regression.model <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("spark")

logistic.regression.model

logistic.regression.workflow <- 
  workflow() %>% 
  add_recipe(logistic.regression.recipe) %>% 
  add_model(logistic.regression.model)

logistic.regression.workflow

logistic.regression.final.model <- 
  logistic.regression.workflow %>% 
  fit(data = train.spark.df)

logistic.regression.final.model

Error: `data` must be a data.frame or a matrix, not a tbl_spark.

Thanks for reading!

Julia Silge · Accepted Answer

So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows. So for example, you can fit just the logistic regression model:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#> 
#>     invoke
#> The following object is masked from 'package:stats':
#> 
#>     filter

sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)


log_spec <- logistic_reg() %>% set_engine("spark")

log_spec %>%
  fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#> 
#> Fit time:  5.1s 
#> Formula: Survived ~ Sex + Fare + Pclass
#> 
#> Coefficients:
#>  (Intercept)     Sex_male         Fare       Pclass 
#>  3.143731639 -2.630648858  0.001450218 -0.917173436

^{Created on 2021-07-09 by the reprex package (v2.0.0)}

But you can't use recipes and workflows out of the box. You might consider trying something like using spark_apply() but that may be a challenge at the current stage of maturity in tidymodels' integration with Spark.

Tidymodels + Spark

Answers (1)

Related Questions