Reputation: 4229
I would like to replicate the result of simple calculation of the mean value per group using a statistical model in R
.
Here is data I'm dealing with:
EXIST DATE VAR1 VAR2
1 0 10/2015 6 4
2 0 01/2016 6 4
3 0 01/2014 5 4
4 1 05/2015 5 4
5 0 11/2015 6 4
Where VAR1
, VAR2
are integers 1:8 (groups/or levels), and EXIST
= (0,1)
I simply calculate the mean per variable:
ag <- data.frame(as.matrix(aggregate(EXIST ~ VAR1 + VAR2,
data = dat, function(x) c(mean = mean(x)))))
Now I would like to replicate this with model (i.e. lm
or glm
, etc) and obtain the same results as with aggregate
.
Hence I would have rate of the 16 values (VAR1 1:8 and VAR 1:8).
Here is my attempt:
tmp <- with(d,
by(d, VAR1,
function(x) lm(EXIST ~ VAR2 + VAR1 , data = dat)))
I was told that the model should be lm(EXIST ~ VAR1 * VAR2, data=dat)
So how I go about replicating the aggregate function
that calculates the mean per each factor of variable? (why asking? Well, I would like to know how to do it with model).
Upvotes: 1
Views: 176
Reputation: 226332
You need to work a little harder to get R to treat each group separately: the relevant formula is VAR1:VAR2-1
(include only the interaction :
, not the main effects: VAR1*VAR2
corresponds to VAR1+VAR2+VAR1:VAR2
; use -1
to remove the intercept from the model).
Sample data:
set.seed(101)
dd <- data.frame(EXIST=rbinom(1000,size=1,prob=0.3),
VAR1=factor(sample(1:8,size=1000,replace=TRUE)),
VAR2=factor(sample(1:8,size=1000,replace=TRUE)))
Note that you need your grouping variables to be factors.
Fit model and extract coefficients:
head(coef(lm(EXIST~VAR1:VAR2-1,data=dd)))
## VAR11:VAR21 VAR12:VAR21 VAR13:VAR21 VAR14:VAR21 VAR15:VAR21 VAR16:VAR21
## 0.2666667 0.2307692 0.5714286 0.1176471 0.3846154 0.1333333
Check the first group:
mean(subset(dd,VAR1=="1" & VAR2=="1")$EXIST) ## 0.266667
Upvotes: 1