Maximilian
Maximilian

Reputation: 4229

Replicate mean by group with model

I would like to replicate the result of simple calculation of the mean value per group using a statistical model in R.

Here is data I'm dealing with:

      EXIST    DATE VAR1 VAR2
    1     0 10/2015    6    4
    2     0 01/2016    6    4
    3     0 01/2014    5    4
    4     1 05/2015    5    4
    5     0 11/2015    6    4

Where VAR1, VAR2 are integers 1:8 (groups/or levels), and EXIST = (0,1)

I simply calculate the mean per variable:

ag <- data.frame(as.matrix(aggregate(EXIST ~ VAR1 + VAR2, 
                                     data = dat, function(x) c(mean = mean(x)))))

Now I would like to replicate this with model (i.e. lm or glm, etc) and obtain the same results as with aggregate.

Hence I would have rate of the 16 values (VAR1 1:8 and VAR 1:8).

Here is my attempt:

tmp <- with(d,
        by(d, VAR1,
           function(x) lm(EXIST ~ VAR2 + VAR1 , data = dat)))

I was told that the model should be lm(EXIST ~ VAR1 * VAR2, data=dat)

So how I go about replicating the aggregate function that calculates the mean per each factor of variable? (why asking? Well, I would like to know how to do it with model).

Upvotes: 1

Views: 176

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226332

You need to work a little harder to get R to treat each group separately: the relevant formula is VAR1:VAR2-1 (include only the interaction :, not the main effects: VAR1*VAR2 corresponds to VAR1+VAR2+VAR1:VAR2; use -1 to remove the intercept from the model).

Sample data:

set.seed(101)
dd <- data.frame(EXIST=rbinom(1000,size=1,prob=0.3),
                 VAR1=factor(sample(1:8,size=1000,replace=TRUE)),
                 VAR2=factor(sample(1:8,size=1000,replace=TRUE)))

Note that you need your grouping variables to be factors.

Fit model and extract coefficients:

head(coef(lm(EXIST~VAR1:VAR2-1,data=dd)))
## VAR11:VAR21 VAR12:VAR21 VAR13:VAR21 VAR14:VAR21 VAR15:VAR21 VAR16:VAR21 
## 0.2666667   0.2307692   0.5714286   0.1176471   0.3846154   0.1333333 

Check the first group:

mean(subset(dd,VAR1=="1" & VAR2=="1")$EXIST)  ## 0.266667

Upvotes: 1

Related Questions