shadowtalker
shadowtalker

Reputation: 13833

Programmatically building formulas without string manipulation

For the sake of example, consider a basic regression model in R:

form1 <- Petal.Length ~ Sepal.Length + Sepal.Width
fit1 <- lm(form1, iris)

(My apologies to any botanists who post here.)

In order to add quadratic and interaction terms, I know of three approaches:

1) the old-fashioned way

Entering terms one at a time:

form2 <- . ~ Sepal.Length*Sepal.Width + I(Sepal.Length^2) + I(Sepal.Width^2)
fit2 <- update(fit1, form2)

This doesn't scale beyond small formulas and you can't program with it.

2) the ugly way

String manipulation:

vars <- attr(terms(form1), "term.labels")
squared_terms <- sprintf("I(%s^2)", vars)
inter_terms <- combn(vars, 2, paste, collapse = "*")
form2 <- reformulate(c(inter_terms, squared_terms), ".")

This scales, but it's not really programmable because the functions themselves need to be hard-coded.

3) the "back entrance"

Manipulate the data directly

library(lazyeval)
library(dplyr)

square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))

vars <- attr(terms(form1), "term.labels")
squared_terms <- lapply(vars, square) %>%
  set_names(paste0(vars, " ^2"))
inter_terms <- combn(vars, 2, inter, simplify = FALSE) %>%
  set_names(combn(vars, 2, paste, collapse = " x "))

fit2 <- model.frame(fit1) %>%
  mutate_(.dots = squared_terms) %>%
  mutate_(.dots = inter_terms) %>%
  lm(Petal.Length ~ ., data = .)

This is fairly scalable, and programmable up to variable naming. But it's also kind of crazy because it defeats the purpose of using a formula in the first place.

what I wish I could do

I wish I could do something like this:

library(lazyeval)
library(dplyr)

square <- function (v) interp(~ I(v1^2), v1 = as.name(v))
inter <- function(v) interp(~ v1*v2, v1 = as.name(v[1]), v2 = as.name(v[2]))

squared_terms <- apply.formula(form1, squared_terms)
inter_terms <- combn.formula(form1, 2, inter)

fit2 <- form1 %>%
  append.formula(squared_terms) %>%
  append.formula(inter_terms) %>%
  update(fit1, .)

Abuse of dplyr aside, there are two killer features here:

  1. the ability to programmatically generate arbitrary formula terms out of basic R objects
  2. the ability to add terms to a formula that will behave like terms that were typed in by hand

Feature 1 is obtainable with Method 3, and Feature 2 is obtainable with Method 2. Is there a Method 4 -- the middle way -- that obtains both at the same time?

Upvotes: 5

Views: 643

Answers (1)

MrFlick
MrFlick

Reputation: 206197

OK, there are a lot of moving pieces here, but here are some helper functions that to very specific things

extract_rhs_symbols <- function(x) {
    as.list(attr(delete.response(terms(x)), "variables"))[-1]
}
symbols_to_formula <- function(x) {
    as.call(list(quote(`~`), x))    
}
sum_symbols <- function(...) {
    Reduce(function(a,b) bquote(.(a)+.(b)), do.call(`c`, list(...), quote=T))
}
square_terms <- function(x) {
    symbols_to_formula(sum_symbols(sapply(extract_rhs_symbols(x), function(x) bquote(I(.(x)^2)))))
}
interact_rhs<-function(x) {
    x[[length(as.list(x))]] <- bquote((.(x[[length(as.list(x))]]))^2)
    x
}
add_rhs_dot <- function(x) {
   x[[length(as.list(x))]] <- sum_symbols(quote(.), x[[length(as.list(x))]])    
   x
}
add_terms<-function(f, x) {
    update(f, add_rhs_dot(x))
}

all of these basically manipulate formulas as calls.

So if you have a formula like

my.formula <- Petal.Length ~ Sepal.Length + Sepal.Width + Other

You can add squared terms with

add_terms(my.formula, square_terms(my.formula))

you can get all the right-hand interactions with

interact_rhs(my.formula)

or do both with

add_terms(interact_rhs(my.formula), square_terms(my.formula))

which gives

Petal.Length ~ Sepal.Length + Sepal.Width + Other + I(Sepal.Length^2) + 
    I(Sepal.Width^2) + I(Other^2) + Sepal.Length:Sepal.Width + 
    Sepal.Length:Other + Sepal.Width:Other

I haven't throughly tested this so there are likely to be cases where this breaks down, but it should work in most simple cases.

Upvotes: 11

Related Questions