jderzol
jderzol

Reputation: 61

R glm() formula syntax with | and >

I saw two R glm formulas which I don't know what they mean.

Suppose we have 3 variables, x1, x2, y. What does it mean when the formula includes >, e.g., glm((y>0) ~ x1 + x2)? What does it mean when | is used, e.g., glm(y ~ x1|x2)?

For the second one, the explanation I found is x1 given x2, but I am not sure how to interpret this when x1 and x2 are both column vectors rather than random variables.

Upvotes: 2

Views: 1278

Answers (1)

Gavin Simpson
Gavin Simpson

Reputation: 174898

> has its usual meaning; Is y greater than 0 or not? This evaluates to a logical vector with TRUE and FALSE for observations greater than 0 or not, respectively. This has the effect of being treated as a vector of 1s and 0s (again, respectively). I presume you left out the bit where you specified a family = binomial or similar to account for the 0/1 nature of the data?

| doesn't have any special meaning in the formula accepted by glm() (and other base R functions). It takes the same meaning as ?'|' which is an OR operator. Hence in x1 | x2, we might think of this as being or(x1, x2), which has the form of a standard function call. The result here is TRUE if x1 or x2 is TRUE, coercing x1 and x2 to logical as required. If x1 and x2 are both numeric, the only way x1 | x2 will be FALSE is if both are exactly equal to 0. As far as this is concerned, this is just a feature of R's formulas and standard non-standard evaluation; A formula can contain function calls, such as log(x), sqrt(y) etc, which get evaluated when the fitting function collects the data needed for fitting.

Here is an example that might explain what | is doing in a formula:

> set.seed(1)
> df <- data.frame(Y = rnorm(5), A = rnorm(5), B = rep(FALSE, 5),
+                  C = c(rep(TRUE, 4), FALSE))
> df
           Y          A     B     C
1 -0.6264538 -0.8204684 FALSE  TRUE
2  0.1836433  0.4874291 FALSE  TRUE
3 -0.8356286  0.7383247 FALSE  TRUE
4  1.5952808  0.5757814 FALSE  TRUE
5  0.3295078 -0.3053884 FALSE FALSE
> model.frame(Y ~ A + (B | C), data = df)
           Y          A B | C
1 -0.6264538 -0.8204684  TRUE
2  0.1836433  0.4874291  TRUE
3 -0.8356286  0.7383247  TRUE
4  1.5952808  0.5757814  TRUE
5  0.3295078 -0.3053884 FALSE

The third column here is formed from a call to '|'(A, B), which results in

> with(df, B | C)
[1]  TRUE  TRUE  TRUE  TRUE FALSE

Notice that you have to wrap the | clause in parentheses otherwise it gobbles up the other terms on the right-hand side of the ~:

> model.frame(Y ~ A + B | C, data = df)
           Y A + B | C
1 -0.6264538      TRUE
2  0.1836433      TRUE
3 -0.8356286      TRUE
4  1.5952808      TRUE
5  0.3295078      TRUE

## Note there is no `A` and all are `TRUE` now.

The reason the last element is now TRUE, notice that the last element of A (-0.3053884) is not exactly equal to 0 and hence it evaluates to TRUE, hence we have TRUE | FALSE, which results in TRUE.

| does have special meaning in other packages, for example in the lme4 package, where it is used to nest random effects.

Upvotes: 10

Related Questions