Reputation: 889
try to use crossfold resampling and fit a random forest from the ranger package. The fit without resampling works but once I try a resample fit it fails with error below.
Consider following df
df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341,
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675,
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L,
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA",
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"),
c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"),
d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L,
3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"),
e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L,
2L, 3L, 3L, 3L), .Label = c("cancer", "general long term",
"psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
Following simple fit works without issues
library(tidymodels)
library(ranger)
rf_spec <- rand_forest(mode = 'regression') %>%
set_engine('ranger')
rf_spec %>%
fit(a ~. , data = df)
But as soon as I want to run the cross validation via
rf_folds <- vfold_cv(df, strata = c)
fit_resamples(a ~ . ,
rf_spec,
rf_folds)
Following error
model: Error in parse.formula(formula, data, env = parent.frame()): Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.
Upvotes: 0
Views: 1033
Reputation: 14316
Julia bet me to it so she gets the karma. I had the same answer (I do what she does but slower):
This is kind of a bug and we've been working to the best way to make it not error. It's complicated. Let me explain.
ranger
is one of a few R packages whose formula method does not create dummy variables (sensibly since it does not need them).
The infrastructure in tune
uses the workflows
package to process the formula then hand the resulting data over to ranger
. By default, workflows does create dummy variables and, since some of your factor levels are not valid R column names (e.g. "general long term"
), ranger()
kicks an error.
(I know that you didn't use a workflow, but that is what happens under the hood).
We are working on the best thing to do here since most users don't know that many tree-based model packages do not produce dummy variables. To make it more complex, parsnip
doesn't use workflows (yet) and did not have given you an error.
Solution for now
Use a simple recipe instead of a formula:
library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidymodels 0.1.0 ──
#> ✓ broom 0.5.4 ✓ recipes 0.1.10
#> ✓ dials 0.0.4 ✓ rsample 0.0.5
#> ✓ dplyr 0.8.5 ✓ tibble 2.1.3
#> ✓ ggplot2 3.3.0 ✓ tune 0.0.1
#> ✓ infer 0.5.1 ✓ workflows 0.1.0
#> ✓ parsnip 0.0.5 ✓ yardstick 0.0.5
#> ✓ purrr 0.3.3
#> ── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x ggplot2::margin() masks dials::margin()
#> x recipes::step() masks stats::step()
df<-structure(list(a = c(1379405931, 732812609, 18614430, 1961678341,
2362202769, 55687714, 72044715, 236503454, 61988734, 2524712675,
98081131, 1366513385, 48203585, 697397991, 28132854), b = structure(c(1L,
6L, 2L, 5L, 7L, 8L, 8L, 1L, 3L, 4L, 3L, 5L, 7L, 2L, 2L), .Label = c("CA",
"IA", "IL", "LA", "MA", "MN", "TX", "WI"), class = "factor"),
c = structure(c(2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 2L, 1L), .Label = c("R", "U"), class = "factor"),
d = structure(c(3L, 3L, 1L, 3L, 3L, 1L, 1L, 3L, 1L, 3L, 1L,
3L, 2L, 3L, 1L), .Label = c("CAH", "LTCH", "STH"), class = "factor"),
e = structure(c(3L, 2L, 3L, 3L, 1L, 3L, 3L, 3L, 2L, 4L, 2L,
2L, 3L, 3L, 3L), .Label = c("cancer", "general long term",
"psychiatric", "rehabilitation"), class = "factor")), row.names = c(NA,
-15L), class = c("tbl_df", "tbl", "data.frame"))
library(tidymodels)
library(ranger)
rf_spec <- rand_forest(mode = 'regression') %>%
set_engine('ranger')
rf_folds <- vfold_cv(df, strata = c)
fit_resamples(recipe(a ~ ., data = df), rf_spec, rf_folds)
#> # 10-fold cross-validation using stratification
#> # A tibble: 9 x 4
#> splits id .metrics .notes
#> * <list> <chr> <list> <list>
#> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
#> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
#> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>
# FYI `tune` 0.0.2 will require a different argument order:
# rf_spec %>% fit_resamples(recipe(a ~ ., data = df), rf_folds)
Created on 2020-03-20 by the reprex package (v0.3.0)
Upvotes: 3
Reputation: 11623
The commenter above is correct that the source of the issue here is the spaces in the factor column. The functions for resampling and the functions for just plain old fitting currently handle that differently, and we are actively looking into how to solve this problem for users. Thank you for your patience!
In the meantime, I would recommend setting up a simple workflow()
plus a recipe()
, which together will handle all the necessary dummy variable munging for you.
library(tidymodels)
rf_spec <- rand_forest(mode = "regression") %>%
set_engine("ranger")
rf_wf <- workflow() %>%
add_model(rf_spec) %>%
add_recipe(recipe(a ~ ., data = df))
fit(rf_wf, data = df)
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────
#> 0 Recipe Steps
#>
#> ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 500
#> Sample size: 15
#> Number of independent variables: 4
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 4.7042e+17
#> R squared (OOB): 0.4341146
rf_folds <- vfold_cv(df, strata = c)
fit_resamples(rf_wf,
rf_folds)
#> # 10-fold cross-validation using stratification
#> # A tibble: 9 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [13/2]> Fold1 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [13/2]> Fold2 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [13/2]> Fold3 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [13/2]> Fold4 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [13/2]> Fold5 <tibble [2 × 3]> <tibble [0 × 1]>
#> 6 <split [13/2]> Fold6 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [14/1]> Fold7 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [14/1]> Fold8 <tibble [2 × 3]> <tibble [0 × 1]>
#> 9 <split [14/1]> Fold9 <tibble [2 × 3]> <tibble [0 × 1]>
Created on 2020-03-20 by the reprex package (v0.3.0)
Upvotes: 5