Russ Lenth
Russ Lenth

Reputation: 6800

Extracting variables from a formula when there are subscripts

There are several posts related to obtaining a list of variables in a regression formula in R - the basic answer being to use all.vars. For example,

> all.vars(log(resp) ~ treat + factor(dose))
[1] "resp"  "treat" "dose"

This is nice because it strips out all of the functions and operators (as well as repeats, not shown). However, this is problematic when the formula contains $ operators or subscripts, such as in

> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> all.vars(form)
[1] "cows"   "weight" "bulls"  "herd"   "breed"

Here, the data frame names cows, bulls, and herd are identified as variables, and the names of the actual variables are decoupled or lost. Instead, what I really want is this result:

> mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed"

What is the most elegant way to do this? I have one proposal that I'll post as an answer, but maybe someone has a more elegant solution and will earn more votes!

Upvotes: 2

Views: 57

Answers (2)

eipi10
eipi10

Reputation: 93871

This isn't sufficient for a general use case, but just for fun I thought I'd take a crack at it:

mystery.fcn = function(string) {
  string = gsub(":", " ", string)
  string = unlist(strsplit(gsub("\\b.*\\b\\(|\\(|\\)|[*~+-]", "", string), split=" "))
  string = string[nchar(string) > 0]
  return(string)
}

form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed" 

form1 = ~x[[y]]
mystery.fcn(form1)
[1] "x[[y]]"

form2 = z$three ~ z$one + z$two - z$x_y
mystery.fcn(form2)
[1] "z$three" "z$one"   "z$two"   "z$x_y"  

form3 = z$three ~ z$one:z$two
mystery.fcn(form3)
[1] "z$three" "z$one"   "z$two"

Upvotes: 2

Russ Lenth
Russ Lenth

Reputation: 6800

One approach that works, though a bit tedious, is to replace the operators $, etc. with legal characters for variable names, turn the string back into a formula, apply all.vars, and un-mangle the results:

All.vars = function(expr, retain = c("\\$", "\\[\\[", "\\]\\]"), ...) {
    # replace operators with unlikely patterns _Av1_, _Av2_, ...
    repl = paste("_Av", seq_along(retain), "_", sep = "")
    for (i in seq_along(retain))
        expr = gsub(retain[i], repl[i], expr)
    # piece things back together in the right order, and call all.vars
    subs = switch(length(expr), 1, c(1,2), c(2,1,3))
    vars = all.vars(as.formula(paste(expr[subs], collapse = "")), ...)
    # reverse the mangling of names
    retain = gsub("\\\\", "", retain)  # un-escape the patterns
    for (i in seq_along(retain))
        vars = gsub(repl[i], retain[i], vars)
    vars
}

Use the retain argument to specify the patterns that we wish to retain rather than treat as operators. The defaults are $, [[, and ]] (all duly escaped) Here are some results:

> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> All.vars(form)
[1] "cows$weight" "bulls[[3]]"  "herd$breed" 

Change retain to also include ( and ):

> All.vars(form, retain = c("\\$", "\\(", "\\)", "\\[\\[", "\\]\\]"))
[1] "log(cows$weight)"   "factor(bulls[[3]])" "herd$breed"

The dots are passed to all.vars, which is really the same as all.names but with different defaults. So we can also obtain the functions and operators not in retain:

> All.vars(form, functions = TRUE)
[1] "~"           "log"         "cows$weight" "*"          
[5] "factor"      "bulls[[3]]"  "herd$breed" 

Upvotes: 2

Related Questions