Dean
Dean

Reputation: 168

Tidymodels + Spark

I'm trying to develop a simple logistic regression model using Tidymodels with the Spark engine. My code works fine when I specify set_engine = "glm", but fails when I attempt to set the engine to spark. Any advice would be much appreciated!

library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train

train.df <- train.df %>% 
  mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
         Sex = factor(Sex),
         Pclass = factor(Pclass))

skimr::skim(train.df)
# Just working with Spark locally.

sc <- spark_connect(master = 'local', version = '3.1')

train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <- 
  recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
  update_role(PassengerId, new_role = 'ID') %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_impute_linear(all_predictors())

logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <- 
  logistic_reg() %>% 
  set_mode("classification") %>% 
  set_engine("spark")

logistic.regression.model
logistic.regression.workflow <- 
  workflow() %>% 
  add_recipe(logistic.regression.recipe) %>% 
  add_model(logistic.regression.model)

logistic.regression.workflow
logistic.regression.final.model <- 
  logistic.regression.workflow %>% 
  fit(data = train.spark.df)

logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.

Thanks for reading!

Upvotes: 2

Views: 772

Answers (1)

Julia Silge
Julia Silge

Reputation: 11603

So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows. So for example, you can fit just the logistic regression model:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(sparklyr)
#> 
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#> 
#>     invoke
#> The following object is masked from 'package:stats':
#> 
#>     filter

sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)


log_spec <- logistic_reg() %>% set_engine("spark")

log_spec %>%
  fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#> 
#> Fit time:  5.1s 
#> Formula: Survived ~ Sex + Fare + Pclass
#> 
#> Coefficients:
#>  (Intercept)     Sex_male         Fare       Pclass 
#>  3.143731639 -2.630648858  0.001450218 -0.917173436

Created on 2021-07-09 by the reprex package (v2.0.0)

But you can't use recipes and workflows out of the box. You might consider trying something like using spark_apply() but that may be a challenge at the current stage of maturity in tidymodels' integration with Spark.

Upvotes: 3

Related Questions