TY Lim
TY Lim

Reputation: 623

Automatically exclude single-level factor variables from regression

I have a custom function which, among other things, creates a regression formula based on specified string inputs and runs a regression (brm, but should work similarly for basic lm):

model_predict <- function(.data, dep_var, model ...) {
    form <- as.formula(str_glue("{dep_var} ~ {model}"))
    form_vars <- all.vars(form)

    ... # some other stuff

    fit <- brm(form, .data, ...)

    ... # some other stuff
}

I use this to fit a large number of pre-specified models as part of a larger workflow.

Sometimes, some of the variables in model are factor variables, and sometimes those factors have only one level in the data. This results in the contrasts can be applied only to factors with 2 or more levels error when trying to fit the model.

Because of the larger workflow and because it's not always clear a priori if the data and model for any given iteration is going to encounter this problem, I'd rather not manually remove the factor vars from the model specified when those vars only have one level in the relevant data subset.

Is there is some setting on lm or brm that can be used to tell the model fitting process itself to ignore single-level factor vars? that would be the easiest solution, but I'm not sure it exists.

Alternatively, I'd like an automated solution that identifies when the single factor level situation arises, and drops the relevant vars from the formula when it does (maybe giving a warning message too), for instance:

# main formula
form
> outcome ~ predictor_1 + predictor_2 * interactor

# Desired outputs...
# if predictor_1 has only one level in data
> outcome ~ predictor_2 * interactor

# if predictor_2 has only one level in data
> outcome ~ predictor_1

# if interactor has only one level in data
> outcome ~ predictor_1 + predictor_2

I've tried tryCatch as suggested here but while that suppresses the contrasts... error, it returns NULL instead of a fitted model ignoring the offending vars, which is what I need.

Additionally, sometimes those variables are in the formula with + and sometimes with * as interaction effects, which makes dynamically building the formula difficult.

Upvotes: 2

Views: 574

Answers (2)

Baraliuh
Baraliuh

Reputation: 2141

Something like this should work:

# For pipe operator
library(magrittr)

# Define function for updating the model
update_formula <- function(data, dep_var, model) {
  # Extract model variables
  model_vars <- stringr::str_split_1(model, '\\+|\\*|:') %>% 
    stringr::str_trim()
  
  model_terms <- stringr::str_split_1(model, '\\+') %>% 
    stringr::str_trim()
  
  # Get the number of levels for all factor variables
  lev_leng <- .data %>% 
    dplyr::select(where(is.factor) & any_of(model_vars)) %>% 
    purrr::map_int(
      ~ length(levels(droplevels(.x)))
    )
  
  # Check if length is one
  invalid <- names(which(lev_leng == 1))
  
  # If any is invalid, it will have length > 0
  if (length(invalid) != 0) {
    for (i in invalid) {
      if (any(stringr::str_detect(model_terms, paste0('\\*[:space:]?', i)))) {
        # Remove interactor from formula
        model_terms <- stringr::str_remove(
          model_terms,
          paste0('[\\*,:]?[:space:]?',i,'[:space:]?[\\*,:]?')
        )
      } else{
        # Remove entire term from formula
        model_terms <- model_terms[stringr::str_detect(model_terms, i, T)]
      }
    }
    model <- stringr::str_flatten(model_terms, '+')
  }
  # Define the new formula
  as.formula(glue::glue('{dep_var} ~ {model}'))
}


# Dep var defition
dep_var <- 'y'
# Model example
model <- "predictor_1 + predictor_2 * interactor"

# Example predictor_1
.data <- tibble::tibble(
  y = runif(10),
  predictor_1 = factor(1),
  predictor_2 = factor(letters[1:10]),
  interactor  = factor(letters[1:10])
)

# Update model
form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2 * interactor
#> <environment: 0x0000015b95006a58>

# Fit the model
fit <- lm(form, .data)

# Example predictor_2
.data <- tibble::tibble(
  y = runif(10),
  predictor_1 = runif(10),
  predictor_2 = factor(letters[rep(1, 10)]),
  interactor  = factor(letters[1:10])
)

form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1
#> <environment: 0x0000015b95cdd218>

# Example interactor
.data <- tibble::tibble(
  y = runif(10),
  predictor_1 = runif(10),
  predictor_2 = factor(letters[1:10]),
  interactor  = factor(letters[rep(1, 10)])
)

form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_1 + predictor_2
#> <environment: 0x0000015b96072878>

# Example predictor_1 & interactor
.data <- tibble::tibble(
  y = runif(10),
  predictor_1 = factor(10),
  predictor_2 = factor(letters[1:10]),
  interactor  = factor(letters[rep(1, 10)])
)

form <- update_formula(.data, dep_var, model)
form
#> y ~ predictor_2
#> <environment: 0x0000015b9669d448>

Created on 2023-06-13 with reprex v2.0.2

Upvotes: 2

I_O
I_O

Reputation: 6921

Edit

An approach to reformulate the formula based on valid columns (assuming the response variable stays the same):

  • example data:
d <- data.frame(Y = rnorm(2),
                x1 = gl(1, 1),
                x2 = gl(1, 2),
                x3 = LETTERS[1:2],
                x4 = rnorm(2)
                )
  • helper function to detect unwanted variables in dataframe d
get_vars_to_omit <- \(d){
  names(d) |>
    sapply(FUN = \(name){xs <- d[[name]]
      is.numeric(xs) | (length(levels(as.factor(xs))) > 1)
    }) |>
    Filter(f = \(xs) !xs) |>  names()
}
> get_vars_to_omit(d)
[1] "x1" "x2"
  • helper function to update an existing model formula by removing terms containing unwanted variables:
update_formula <- function(old_formula, vars_to_omit){
  var_names <- all.vars(old_formula)
  old_terms <- terms(old_formula)
  terms_to_drop <- attr(old_terms, 'term.labels') |>
    grep(pattern = paste(vars_to_omit, collapse = '|'))
  new_terms <- drop.terms(old_terms, terms_to_drop)
  attr(new_terms, 'term.labels') |> 
    reformulate(response = var_names[1])
}
  • example formula containing unwanted variables:
old_formula <- formula(Y ~ x1 + x2 + x3 + x4)
  • update formula:
new_formula <- update_formula(old_formula, get_vars_to_omit(d))
## terms with unwanted variables have been removed:
## > new_formula
Y ~ x3 + x4
  • result:
## > lm(new_formula, data = d)

Call:
lm(formula = new_formula, data = d)

Coefficients:
(Intercept)          x3B           x4  
      0.901       -1.222           NA  

(edit ends here)


Could you add some a knock-out step at the start? Like:

d |> select_if(~ is.numeric(.x) | (length(levels(as.factor(.x))) > 1))

Above is a "tidy" variant which can also expressed with lapply or Map of base R. For a more automated feature selection, {caret} or {tidymodels} might be worth a try.

Upvotes: 0

Related Questions