Reputation: 6800
There are several posts related to obtaining a list of variables in a regression formula in R - the basic answer being to use all.vars
. For example,
> all.vars(log(resp) ~ treat + factor(dose))
[1] "resp" "treat" "dose"
This is nice because it strips out all of the functions and operators (as well as repeats, not shown). However, this is problematic when the formula contains $
operators or subscripts, such as in
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> all.vars(form)
[1] "cows" "weight" "bulls" "herd" "breed"
Here, the data frame names cows
, bulls
, and herd
are identified as variables, and the names of the actual variables are decoupled or lost. Instead, what I really want is this result:
> mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
What is the most elegant way to do this? I have one proposal that I'll post as an answer, but maybe someone has a more elegant solution and will earn more votes!
Upvotes: 2
Views: 57
Reputation: 93871
This isn't sufficient for a general use case, but just for fun I thought I'd take a crack at it:
mystery.fcn = function(string) {
string = gsub(":", " ", string)
string = unlist(strsplit(gsub("\\b.*\\b\\(|\\(|\\)|[*~+-]", "", string), split=" "))
string = string[nchar(string) > 0]
return(string)
}
form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
mystery.fcn(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
form1 = ~x[[y]]
mystery.fcn(form1)
[1] "x[[y]]"
form2 = z$three ~ z$one + z$two - z$x_y
mystery.fcn(form2)
[1] "z$three" "z$one" "z$two" "z$x_y"
form3 = z$three ~ z$one:z$two
mystery.fcn(form3)
[1] "z$three" "z$one" "z$two"
Upvotes: 2
Reputation: 6800
One approach that works, though a bit tedious, is to replace the operators $
, etc. with legal characters for variable names, turn the string back into a formula, apply all.vars
, and un-mangle the results:
All.vars = function(expr, retain = c("\\$", "\\[\\[", "\\]\\]"), ...) {
# replace operators with unlikely patterns _Av1_, _Av2_, ...
repl = paste("_Av", seq_along(retain), "_", sep = "")
for (i in seq_along(retain))
expr = gsub(retain[i], repl[i], expr)
# piece things back together in the right order, and call all.vars
subs = switch(length(expr), 1, c(1,2), c(2,1,3))
vars = all.vars(as.formula(paste(expr[subs], collapse = "")), ...)
# reverse the mangling of names
retain = gsub("\\\\", "", retain) # un-escape the patterns
for (i in seq_along(retain))
vars = gsub(repl[i], retain[i], vars)
vars
}
Use the retain
argument to specify the patterns that we wish to retain rather than treat as operators. The defaults are $
, [[
, and ]]
(all duly escaped) Here are some results:
> form = log(cows$weight) ~ factor(bulls[[3]]) * herd$breed
> All.vars(form)
[1] "cows$weight" "bulls[[3]]" "herd$breed"
Change retain
to also include (
and )
:
> All.vars(form, retain = c("\\$", "\\(", "\\)", "\\[\\[", "\\]\\]"))
[1] "log(cows$weight)" "factor(bulls[[3]])" "herd$breed"
The dots are passed to all.vars
, which is really the same as all.names
but with different defaults. So we can also obtain the functions and operators not in retain
:
> All.vars(form, functions = TRUE)
[1] "~" "log" "cows$weight" "*"
[5] "factor" "bulls[[3]]" "herd$breed"
Upvotes: 2