Reputation: 10964
I have a dataset that contains a number of (factor) variables with the prefix "cat_".
library(tidyverse)
library(modelr)
library(lazyeval)
library(purrr)
# create the dataset
df_foo = wakefield::r_data_frame(
n = 100,
wakefield::r_series(wakefield::r_sample, j = 5, name = "cat"),
Y = wakefield::normal()
)
I want to be able to compute the pairwise, k-fold cross-validated regression R2 of each of these factor variables with the response variable using the tidy framework.
It is easy to compute this across folds for a few variables as below.
df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate_(
.dots = setNames(
list(
interp(
quote(
purrr::map_dbl(train, .f = function(train_data) {
summary(stats::lm(Y ~ cat_1, data = train_data))$r.squared
}))),
interp(
quote(
purrr::map_dbl(train, .f = function(train_data) {
summary(stats::lm(Y ~ cat_2, data = train_data))$r.squared
})))
),
nm = c("cat_1", "cat_2")
)
)
How to generalize this to an arbitrary number of variables?
Why do I explicitly have to use the namespace accessors for the functions purrr::map_dbl
and stats::lm
(the logic above will not work if I remove the namespace accessors)?
The following code gets the R2 for each of the variables, but it cannot be flattened out to a number of columns equal to the number of variables in the dataset.
make_r2_variable = function(var_name, train_data) {
summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}
make_r2 = function(train_data) {
summarise_at(
.tbl = data.frame(train_data),
.cols = vars(starts_with("cat_")),
.funs = funs(make_r2_variable(., train_data = train_data))
)
}
df_foo = df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate(
R2 = map(.x = train, .f = make_r2)
)
Upvotes: 1
Views: 76
Reputation: 10964
The solution that I think is as compact as it can get is this:
make_r2_variable = function(var_name, train_data) {
summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}
make_r2 = function(train_data) {
summarise_at(
.tbl = data.frame(train_data),
.cols = vars(starts_with("cat_")),
.funs = funs(make_r2_variable(., train_data = train_data))
)
}
df_foo = df_foo %>%
mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>%
crossv_kfold(k = 10, id = "id") %>%
mutate(
R2 = map(.x = train, .f = make_r2)
) %>%
unnest(R2)
This is basically the solution that I had in the edit plus unnest
. This basically mutates the S3: resample
column using map
and within that it uses mutate_at
to cycle over the columns that match the criterion. Since that returns a list/1D data.frame
, a call to unnest
is required.
Upvotes: 1