mDe
mDe

Reputation: 107

Character extraction in R?

How would I extract 'mpg' from the following formula in R? I understand that it would be useful to convert the formula into character first and then use some kind of regex. But I don't know which one.

mpg ~ x1 + x2

Upvotes: 1

Views: 87

Answers (4)

d.b
d.b

Reputation: 32548

Here's an approach that uses regex

x = mpg ~ x1 + x2
gsub(" ","",gsub("~.*", "", deparse(x)))
#[1] "mpg"

Upvotes: 3

Benjamin
Benjamin

Reputation: 17369

All of the given answers will work for your specific use case. But if you wish to use this in a more generalized sense, there are some caveats to be aware of. To discuss these, we'll define a few formulae

fm <- mpg ~ x1 + x2
fm_one <- ~ x1 + x2
fm_multi <- mpg + y1 ~ x1 + x2

all.vars will return a character vector of all of the variables in the formula. It is the fastest of the options given to this point. However, it does not distinguish between variables on the left hand and right hand side of the equation. Whether or not this is acceptable depends on your use case.

all.vars(fm)[1]         # "mpg"
all.vars(fm_one)[1]     # "x1" (this is a right hand side variable)
all.vars(fm_multi)[1]   # "mpg"  (missing other left hand side variables)

The terms approach (as.character(attr(terms(fm), "variables"))) will generate a similar vector, but the variable names start in the second position (the list call takes up the first element). It suffers the same disadvantages as the all.vars approach.

as.character(attr(terms(fm), "variables"))[2]        # "mpg"
as.character(attr(terms(fm_one), "variables"))[2]    # "x1"
as.character(attr(terms(fm_multi), "variables"))[2]  # "mpg"

Using as.character produces a character vector of either length 3 or 2, depending on if there is or isn't a left hand side. This at least has the ability to return the entire left side, but it won't return a character vector of the response variables. It still has the disadvantage, however, of not being distinguishing left sides variables from right side variables.

as.character(fm)        # "~" "mpg" "x1" "x2"
as.character(fm_one)    # "~" "x1" "x2"
as.character(fm_multi)  # "~" "mpg" "y1" "x1" "x2"

The deparse method is somewhat slower than all.vars (but still measured in nanoseconds), and has the primary advantage of distinguishing left hand side from right hand side.

gsub(" ","",gsub("~.*", "", deparse(fm)))        # "mpg"
gsub(" ","",gsub("~.*", "", deparse(fm_one)))    # ""
gsub(" ","",gsub("~.*", "", deparse(fm_multi)))  # "mpg+y1"

Depending on your actual needs, you may not need to protect against one-sided or multivariate formulae. If you are working in a system where it is known that all of your formulae will be univariate and two sided, all.vars is probably your best bet. If you can't be sure of that, I'd recommend using the deparse method. That will at least ensure that you always get response variables when you are looking for response variables.

Upvotes: 3

akrun
akrun

Reputation: 887028

We can use all.vars

all.vars(form)[1]
#[1] "mpg"

Or with terms

as.character(attr(terms(form), "variables")[[2]])
#[1] "mpg"

Or another option is

paste(form)[[2]]
#[1] "mpg"

where

form <-  mpg ~ x1 + x2

Upvotes: 5

Marco Sandri
Marco Sandri

Reputation: 24252

Given the formula:

frm <- as.formula(mpg ~ x1 + x2)

it is possible to extract the term on the left side simply using:

as.character(frm[[2]])
[1] "mpg"

Upvotes: 2

Related Questions