Spacedman
Spacedman

Reputation: 94277

What alternative ways are there to specify binomial successes/trials in a formula?

Suppose you are modelling binomial data where each response is a number of successes (y) from a number of trials (N) with some explanatory variables (a and b). There's a few functions that do this kind of thing, and they all seem to use different methods to specify y and N.

In glm, you do glm(cbind(y,N-y)~a+b, data = d) (matrix of success/fail on LHS)

In inla, you do inla(y~a+b, Ntrials=d$N, data=d) (specify number of trials separately)

In glmmBUGS, you do glmmBUGS(y+N~a+b,data=d) (specify success + trials as terms on LHS)

When programming new methods, I've always thought it best to follow what glm does, since that's where people would normally first encounter binomial response data. However, I can never remember if its cbind(y,N-y) or cbind(y,N) - and I usually seem to have success/number of trials in my data rather than success/number of fails - YMMV.

Other approaches are possible of course. For example using a function on the RHS to mark whether the variable is number of trials or number of fails:

myblm( y ~ a + b + Ntrials(N), data=d)
myblm( y ~ a + b + Nfails(M), data=d)  # if your dataset has succ/fail variables

or defining an operator to just do a cbind, so you can do:

myblm( y %of% N ~ a + b, data=d)

thus attaching some meaning to the LHS making it explicit.

Has anyone got any better ideas? What's the right way to do this?

Upvotes: 11

Views: 2789

Answers (3)

Benjamin Christoffersen
Benjamin Christoffersen

Reputation: 4841

You can also let y be fraction in which case you need to supply the weights. It is not in the formula argument but an almost equal amount of keystrokes as if it was in the formula. Here is an example

> set.seed(73574836)
> x <- runif(10)
> n <- sample.int(10, 2)
> y <- sapply(mapply(rbinom, size = 1, n, (1 + exp(1 - x))^-1), function(x) 
+   sum(x == 1))
> df <- data.frame(y = y, frac = y / n, x = x, weights = n)
> df
   y  frac      x weights
1  2 1.000 0.9051       2
2  5 0.625 0.3999       8
3  1 0.500 0.4649       2
4  4 0.500 0.5558       8
5  0 0.000 0.8932       2
6  3 0.375 0.1825       8
7  1 0.500 0.1879       2
8  4 0.500 0.5041       8
9  0 0.000 0.5070       2
10 3 0.375 0.3379       8
> 
> # the following two fits are identical
> summary(glm(cbind(y, weights - y) ~ x, binomial(), df))

Call:
glm(formula = cbind(y, weights - y) ~ x, family = binomial(), 
    data = df)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

> summary(glm(frac ~ x, binomial(), df, weights = weights))

Call:
glm(formula = frac ~ x, family = binomial(), data = df, weights = weights)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.731  -0.374   0.114   0.204   1.596  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.416      0.722   -0.58     0.56
x              0.588      1.522    0.39     0.70

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9.5135  on 9  degrees of freedom
Residual deviance: 9.3639  on 8  degrees of freedom
AIC: 28.93

Number of Fisher Scoring iterations: 3

The reason the above works comes down to what glm actually does for binomial outcomes. It computes a fraction for each observation and a weight associated with the observation regardless of how you specify the outcome. Here is a snippet from ?glm which gives a hint of what is going in the estimation

If a binomial glm model was specified by giving a two-column response, the weights returned by prior.weights are the total numbers of cases (factored by the supplied case weights) and the component y of the result is the proportion of successes.

Alternatively, you can make a wrapper for glm.fit or glm using model.frame. See the ... argument in ?model.frame

... for model.frame methods, a mix of further arguments such as data, na.action, subset to pass to the default method. Any additional arguments (such as offset and weights or other named arguments) which reach the default method are used to create further columns in the model frame, with parenthesised names such as "(offset)".

Comment

I saw Ben Bolker's comment afterwards. The above is what he points out.

Upvotes: 1

rcorty
rcorty

Reputation: 1200

I like this method from the glm documentation:

For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)

This comports well with the way successes and failures often arise in my experience. One is a catch-all (e.g. "didn't vote") and there are a variety of ways to achieve the other (e.g. "voted for A", "voted for B"). I hope it's clear from the way I'm phrasing this that "success" and "failure" as defined by glm can be defined arbitrarily so that the first level corresponds to a "failure" and all the other levels are a "success".

Upvotes: 0

Aviad Klein
Aviad Klein

Reputation: 105

From r's help page on glm : "...or as a two-column matrix with the columns giving the numbers of successes and failures"

So it has to be cbind(Y, N-Y)

Upvotes: -2

Related Questions