tchakravarty
tchakravarty

Reputation: 10964

modelr: cross-validated model fitting for all variables in a dataset that match criterion

I have a dataset that contains a number of (factor) variables with the prefix "cat_".

library(tidyverse)
library(modelr)
library(lazyeval)
library(purrr)

# create the dataset
df_foo = wakefield::r_data_frame(
  n = 100,
  wakefield::r_series(wakefield::r_sample, j = 5, name = "cat"),
  Y = wakefield::normal()
)

I want to be able to compute the pairwise, k-fold cross-validated regression R2 of each of these factor variables with the response variable using the tidy framework.

It is easy to compute this across folds for a few variables as below.

df_foo %>% 
  mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>% 
  crossv_kfold(k = 10, id = "id") %>% 
  mutate_(
    .dots = setNames(
      list(
        interp(
          quote(
            purrr::map_dbl(train, .f = function(train_data) {
          summary(stats::lm(Y ~ cat_1, data = train_data))$r.squared
        }))),
        interp(
          quote(
            purrr::map_dbl(train, .f = function(train_data) {
          summary(stats::lm(Y ~ cat_2, data = train_data))$r.squared
        })))
      ),
      nm = c("cat_1", "cat_2")
    )
  )

Questions:


Edit:

The following code gets the R2 for each of the variables, but it cannot be flattened out to a number of columns equal to the number of variables in the dataset.

make_r2_variable = function(var_name, train_data) {
  summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}

make_r2 = function(train_data) {
  summarise_at(
    .tbl =  data.frame(train_data),
    .cols = vars(starts_with("cat_")),
    .funs = funs(make_r2_variable(., train_data = train_data))
  )

}

df_foo = df_foo %>% 
  mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>% 
  crossv_kfold(k = 10, id = "id") %>% 
  mutate(
    R2 = map(.x = train, .f = make_r2)
  ) 

Upvotes: 1

Views: 76

Answers (1)

tchakravarty
tchakravarty

Reputation: 10964

The solution that I think is as compact as it can get is this:

make_r2_variable = function(var_name, train_data) {
  summary(stats::lm(Y ~ var_name, data = train_data))$r.squared
}

make_r2 = function(train_data) {
  summarise_at(
    .tbl =  data.frame(train_data),
    .cols = vars(starts_with("cat_")),
    .funs = funs(make_r2_variable(., train_data = train_data))
  )

}

df_foo = df_foo %>% 
  mutate_at(.funs = funs(as.factor), .cols = vars(starts_with("cat"))) %>% 
  crossv_kfold(k = 10, id = "id") %>% 
  mutate(
    R2 = map(.x = train, .f = make_r2)
  ) %>% 
  unnest(R2)

This is basically the solution that I had in the edit plus unnest. This basically mutates the S3: resample column using map and within that it uses mutate_at to cycle over the columns that match the criterion. Since that returns a list/1D data.frame, a call to unnest is required.

Upvotes: 1

Related Questions