Reputation: 168
I'm trying to develop a simple logistic regression model using Tidymodels with the Spark engine. My code works fine when I specify set_engine = "glm"
, but fails when I attempt to set the engine to spark
. Any advice would be much appreciated!
library(tidyverse)
library(sparklyr)
library(tidymodels)
train.df <- titanic::titanic_train
train.df <- train.df %>%
mutate(Survived = factor(ifelse(Survived == 1, 'Y', 'N')),
Sex = factor(Sex),
Pclass = factor(Pclass))
skimr::skim(train.df)
# Just working with Spark locally.
sc <- spark_connect(master = 'local', version = '3.1')
train.spark.df <- copy_to(sc, train.df)
logistic.regression.recipe <-
recipe(Survived ~ PassengerId + Sex + Age + Pclass, data = train.spark.df) %>%
update_role(PassengerId, new_role = 'ID') %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_impute_linear(all_predictors())
logistic.regression.recipe
summary(logistic.regression.recipe)
logistic.regression.model <-
logistic_reg() %>%
set_mode("classification") %>%
set_engine("spark")
logistic.regression.model
logistic.regression.workflow <-
workflow() %>%
add_recipe(logistic.regression.recipe) %>%
add_model(logistic.regression.model)
logistic.regression.workflow
logistic.regression.final.model <-
logistic.regression.workflow %>%
fit(data = train.spark.df)
logistic.regression.final.model
Error: `data` must be a data.frame or a matrix, not a tbl_spark.
Thanks for reading!
Upvotes: 2
Views: 772
Reputation: 11603
So the support for Spark in tidymodels is not even across all the parts of a modeling analysis. The support for modeling in parsnip is good, but we don't have fully featured support for feature engineering in recipes or putting those building blocks together in workflows. So for example, you can fit just the logistic regression model:
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(sparklyr)
#>
#> Attaching package: 'sparklyr'
#> The following object is masked from 'package:purrr':
#>
#> invoke
#> The following object is masked from 'package:stats':
#>
#> filter
sc <- spark_connect(master = "local")
train_sp <- copy_to(sc, titanic::titanic_train, overwrite = TRUE)
log_spec <- logistic_reg() %>% set_engine("spark")
log_spec %>%
fit(Survived ~ Sex + Fare + Pclass, data = train_sp)
#> parsnip model object
#>
#> Fit time: 5.1s
#> Formula: Survived ~ Sex + Fare + Pclass
#>
#> Coefficients:
#> (Intercept) Sex_male Fare Pclass
#> 3.143731639 -2.630648858 0.001450218 -0.917173436
Created on 2021-07-09 by the reprex package (v2.0.0)
But you can't use recipes and workflows out of the box. You might consider trying something like using spark_apply()
but that may be a challenge at the current stage of maturity in tidymodels' integration with Spark.
Upvotes: 3