Reputation: 61
I saw two R glm formulas which I don't know what they mean.
Suppose we have 3 variables, x1
, x2
, y
. What does it mean when the formula includes >
, e.g., glm((y>0) ~ x1 + x2)
? What does it mean when |
is used, e.g., glm(y ~ x1|x2)
?
For the second one, the explanation I found is x1
given x2
, but I am not sure how to interpret this when x1
and x2
are both column vectors rather than random variables.
Upvotes: 2
Views: 1278
Reputation: 174898
>
has its usual meaning; Is y
greater than 0
or not? This evaluates to a logical vector with TRUE
and FALSE
for observations greater than 0 or not, respectively. This has the effect of being treated as a vector of 1
s and 0
s (again, respectively). I presume you left out the bit where you specified a family = binomial
or similar to account for the 0
/1
nature of the data?
|
doesn't have any special meaning in the formula accepted by glm()
(and other base R functions). It takes the same meaning as ?'|'
which is an OR operator. Hence in x1 | x2
, we might think of this as being or(x1, x2)
, which has the form of a standard function call. The result here is TRUE
if x1
or x2
is TRUE
, coercing x1
and x2
to logical as required. If x1
and x2
are both numeric, the only way x1 | x2
will be FALSE
is if both are exactly equal to 0
. As far as this is concerned, this is just a feature of R's formulas and standard non-standard evaluation; A formula can contain function calls, such as log(x)
, sqrt(y)
etc, which get evaluated when the fitting function collects the data needed for fitting.
Here is an example that might explain what |
is doing in a formula:
> set.seed(1)
> df <- data.frame(Y = rnorm(5), A = rnorm(5), B = rep(FALSE, 5),
+ C = c(rep(TRUE, 4), FALSE))
> df
Y A B C
1 -0.6264538 -0.8204684 FALSE TRUE
2 0.1836433 0.4874291 FALSE TRUE
3 -0.8356286 0.7383247 FALSE TRUE
4 1.5952808 0.5757814 FALSE TRUE
5 0.3295078 -0.3053884 FALSE FALSE
> model.frame(Y ~ A + (B | C), data = df)
Y A B | C
1 -0.6264538 -0.8204684 TRUE
2 0.1836433 0.4874291 TRUE
3 -0.8356286 0.7383247 TRUE
4 1.5952808 0.5757814 TRUE
5 0.3295078 -0.3053884 FALSE
The third column here is formed from a call to '|'(A, B)
, which results in
> with(df, B | C)
[1] TRUE TRUE TRUE TRUE FALSE
Notice that you have to wrap the |
clause in parentheses otherwise it gobbles up the other terms on the right-hand side of the ~
:
> model.frame(Y ~ A + B | C, data = df)
Y A + B | C
1 -0.6264538 TRUE
2 0.1836433 TRUE
3 -0.8356286 TRUE
4 1.5952808 TRUE
5 0.3295078 TRUE
## Note there is no `A` and all are `TRUE` now.
The reason the last element is now TRUE
, notice that the last element of A
(-0.3053884
) is not exactly equal to 0
and hence it evaluates to TRUE
, hence we have TRUE | FALSE
, which results in TRUE
.
|
does have special meaning in other packages, for example in the lme4 package, where it is used to nest random effects.
Upvotes: 10