Paulo Barros
Paulo Barros

Reputation: 157

What is the difference of "+" versus "*" in ANOVA model?

       GG      AMB GGXAMB     ATF6.M
1    COBB CONFORTO     CC  1.7391386
2    COBB CONFORTO     CC  0.8269537
3    COBB CONFORTO     CC  0.3464495
4    COBB CONFORTO     CC  1.3126458
5    COBB CONFORTO     CC  1.3938351
6    COBB CONFORTO     CC  1.0969472
7    COBB   STRESS     CS  3.1431619
8    COBB   STRESS     CS  0.9023480
9    COBB   STRESS     CS  2.5106332
10   COBB   STRESS     CS  1.2833235
11   COBB   STRESS     CS  0.4485298
12   COBB   STRESS     CS  0.3553028
13 PELOCO CONFORTO     PC  0.3481456
14 PELOCO CONFORTO     PC  2.5095779
15 PELOCO CONFORTO     PC  0.8871572
16 PELOCO CONFORTO     PC  2.3148108
17 PELOCO CONFORTO     PC 73.2463832
18 PELOCO CONFORTO     PC 16.0056771
19 PELOCO   STRESS     PS 15.4836898
20 PELOCO   STRESS     PS  1.2041695
21 PELOCO   STRESS     PS  1.8424005
22 PELOCO   STRESS     PS  0.9193776
23 PELOCO   STRESS     PS  0.9451780
24 PELOCO   STRESS     PS  0.9715508

Sorry if the question is too dumb, but I didn't find an answer yet.

What would be the statistical difference of these 2 models at an ANOVA analysis in R:

  1. aov(ATF6.M ~ G + AMB + GGXAMB, data)
  2. aov(ATF6.M ~ G*AMB, data)

I noticed from the results that when you use the "*" it computes the ANOVA for each independent variable and also for the interaction (eg: GG:AMB). But if you take a look at my table, the GGXAMB variable is exactly that interaction, but if a compare the results with the values obtained with GG:AMB on the ANOVA summary with that of the 1. formula, they are close, but not the same. My models are right?

Upvotes: 2

Views: 69

Answers (1)

StupidWolf
StupidWolf

Reputation: 46978

Using your data:

data = structure(list(GG = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("COBB", "PELOCO"), class = "factor"), AMB = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("CONFORTO", "STRESS"), class = "factor"), 
    GGXAMB = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L
    ), .Label = c("CC", "CS", "PC", "PS"), class = "factor"), 
    ATF6.M = c(1.7391386, 0.8269537, 0.3464495, 1.3126458, 1.3938351, 
    1.0969472, 3.1431619, 0.902348, 2.5106332, 1.2833235, 0.4485298, 
    0.3553028, 0.3481456, 2.5095779, 0.8871572, 2.3148108, 73.2463832, 
    16.0056771, 15.4836898, 1.2041695, 1.8424005, 0.9193776, 
    0.945178, 0.9715508)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
))

We do the anova:

f1 = aov(ATF6.M ~ GG + AMB + GGXAMB, data=data)
f2 = aov(ATF6.M ~ GG * AMB, data=data)

The variance that can be explained is essentially the same:

summary(f1)
            Df Sum Sq Mean Sq F value Pr(>F)
GG           1    428   427.7   1.990  0.174
AMB          1    216   216.1   1.005  0.328
GGXAMB       1    240   239.9   1.116  0.303
Residuals   20   4299   214.9               
summary(f2)
            Df Sum Sq Mean Sq F value Pr(>F)
GG           1    428   427.7   1.990  0.174
AMB          1    216   216.1   1.005  0.328
GG:AMB       1    240   239.9   1.116  0.303
Residuals   20   4299   214.9 

The coefficients are different:

f1$coefficients
(Intercept)    GGPELOCO   AMBSTRESS    GGXAMBCS    GGXAMBPC    GGXAMBPS 
   1.119328   14.765964  -12.324231   12.645452          NA          NA 
f2$coefficients
       (Intercept)           GGPELOCO          AMBSTRESS GGPELOCO:AMBSTRESS 
         1.1193283         14.7659637          0.3212216        -12.6454525 

This is because in the first regression, combinations of GGXAMB can return you coefficients of GG, for example CC + CS gives you COBB in GG, making 3 of your coefficients redundant. This will cause problems in estimating the coefficients. The effect in this case, is AMBSTRESS getting a small value and rest being NA.

You can read a bit about it in this discussion and maybe this, the term for this is full ranked matrix.

To answer your question, you should use aov(ATF6.M ~ GG*AMB, data) or aov(ATF6.M ~ GG+AMB+GG:AMB, data), it comes from fitting a linear model on a full ranked matrix and all the coefficients are estimate-able (as you can see from above).

Upvotes: 2

Related Questions