Alice Hobbs
Alice Hobbs

Reputation: 1215

R: One Way Anova and pairwise post hoc tests (Turkey, Scheffe or other) for numerical columns

I have three columns in the dataframe dune (below - bottom of the page) describing the % cover of marram grass recorded for three different sand dune ecosystems:

(1) Restored; (2) Degraded; and (3) Natural;

I have performed two different One Way Anova tests (below) - test 1 and test 2 - to establish significant differences between ecosystems. Test 1 clearly shows significant differences between ecosystems; however, test 2 shows no significant differences. The box plot's (below) show stark differences in variance between ecosystems.

Afterwards, I melted the dataframe to produce a factorial column (i.e, headed Ecosystem.Type) which is also the response variable. The idea is to apply a glm model (test 3 - below)to test with a One Way Anova; however, this method was unsuccessful (please find the error message below).

Problem

I am confused whether my code to perform each One Way Anova test is correct and the correct procedure to perform post hoc tests (Turkey HSD, Scheffe or others) to distinguish pairs of ecosystems that are significantly different. If anyone can help, I would be deeply appreciative for your advice. Many thanks....

data(dune)

Test 1

dune.type.1<-aov(Natural~Restored+Degraded, data=dune)
summary.aov(dune.type.1, intercept=T)

               Df Sum Sq Mean Sq F value   Pr(>F)    
     (Intercept)  1  34694   34694 138.679 1.34e-09 ***
     Restored     1     94      94   0.375    0.548    
     Degraded     1    486     486   1.942    0.181    
     Residuals   17   4253     250                     
           ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Post-hoc test's

    posthoc<-TukeyHSD(dune.type.1, conf.level=0.95)

    Error in TukeyHSD.aov(dune.type.1, conf.level = 0.95) : 

    no factors in the fitted model

    In addition: Warning messages:
    1: In replications(paste("~", xx), data = mf) :
       non-factors ignored: Restored
    2: In replications(paste("~", xx), data = mf) :
       non-factors ignored: Degraded

Test 2

     dune1<-aov(Restored~Natural, data=dune)
     dune2<-aov(Restored~Degraded, data=dune)
     dune3<-aov(Degraded~Natural, data=dune)

     summary(dune1)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Natural      1     86   85.58   0.356  0.558
     Residuals   18   4325  240.26               

    summary(dune2)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Degraded     1    160   159.7   0.676  0.422
     Residuals   18   4250   236.1               

     summary(dune3)

                 Df Sum Sq Mean Sq F value Pr(>F)
     Natural      1  168.5  168.49   2.318  0.145
     Residuals   18 1308.5   72.69   

Test 3

melt.dune<-melt(dune, measure.vars=c("Degraded", "Restored", "Natural"))


colnames(melt.dune)=c("Ecosystem.Type", "Percentage.cover")
melt.dune$Percentage.cover<-as.numeric(melt.dune$Percentage.cover)

glm.dune<-glm(Ecosystem.Type~Percentage.cover, data=melt.dune)
summary(glm.dune)

Error

glm.dune<-glm(Ecosystem.Type~Percentage.cover, data=melt.dune)
Error in glm.fit(x = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  : 
NA/NaN/Inf in 'y'
In addition: Warning messages:
1: In Ops.factor(y, mu) : ‘-’ not meaningful for factors
2: In Ops.factor(eta, offset) : ‘-’ not meaningful for factors
3: In Ops.factor(y, mu) : ‘-’ not meaningful for factors

Melted Dataframe

structure(list(Ecosystem.Type = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Degraded", "Restored", 
"Natural"), class = "factor"), Percentage.cover = c(12, 17, 21, 
11, 22, 16, 7, 9, 14, 2, 3, 15, 23, 4, 19, 36, 26, 4, 15, 23, 
38, 46, 65, 35, 54, 29, 48, 13, 19, 33, 37, 55, 11, 53, 13, 24, 
28, 44, 42, 39, 18, 61, 31, 46, 51, 51, 41, 44, 55, 47, 73, 43, 
25, 42, 21, 13, 65, 30, 47, 29)), row.names = c(NA, -60L), .Names =         c("Ecosystem.Type", 
 "Percentage.cover"), class = "data.frame")

enter image description here

Data

 structure(list(Degraded = c(12L, 17L, 21L, 11L, 22L, 16L, 7L, 
 9L, 14L, 2L, 3L, 15L, 23L, 4L, 19L, 36L, 26L, 4L, 15L, 23L), 
 Restored = c(38L, 46L, 65L, 35L, 54L, 29L, 48L, 13L, 19L, 
 33L, 37L, 55L, 11L, 53L, 13L, 24L, 28L, 44L, 42L, 39L), Natural = c(18L, 
 61L, 31L, 46L, 51L, 51L, 41L, 44L, 55L, 47L, 73L, 43L, 25L, 
 42L, 21L, 13L, 65L, 30L, 47L, 29L)), .Names = c("Degraded", 
 "Restored", "Natural"), class = "data.frame", row.names = c(NA, 
 -20L))

Upvotes: 0

Views: 2646

Answers (1)

Akis
Akis

Reputation: 130

there are several things I would like to point to you.

First, the test 1 and test 2 produce similar results. The only difference is that you selected an intercept on test 1 and thus the outcome tells you that if you fit a linear model (I will come to that in a few minutes) intercept is required. Hence the significance you see is about whether the line you force to fit needs an intercept or not. Try using "intercept=T" to the other outcomes and I am pretty sure you will get similar results.

The second thing you should be careful is about the linear model you try to fit. The dune.type.1 model is a model where you actually see how correlated the different sand dune ecosystems are. In other words, you assume that there is a linear association between natural and restored and with every unit increase of the restored you have some increase on the natural. If I understood correctly what you want is to examine the mean differences and not their correlation. Thus you can do two things:

  1. The data is prepared to perform t.tests (a test that compares the mean between several categories). It is very easy to do in R and valid since all the variables are reasonably normal. However you will have multiple testing issues (you will perform probably 3 t-tests to get all the mean differences), and thus need to use a Bonferroni correction.

  2. But I think what you really want is the following:

First reform the data

       data <- data.frame(v = c(dune$Degraded, dune$Restored, dune$Natural), 
                   labels = c(rep("Degraded", times = 20), rep("Restored", times = 20), 
                              rep("Natural", times = 20)))

Then fit a linear model

    mod.1 <- lm(v ~ labels, data = data)
    summary(mod.1)
    lm(formula = v ~ labels, data = data)

Residuals:
Min      1Q  Median      3Q     Max 
-28.650 -10.725   0.875   8.050  31.350 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      14.950      3.066   4.875 9.07e-06 ***
labelsNatural    26.700      4.337   6.157 7.95e-08 ***
labelsRestored   21.350      4.337   4.923 7.64e-06 ***

where you can actually see that the mean of the baseline category (i.e the degraded) is significantly smaller with the mean of the natural category and etc. You can also check the model assumptions, to see if your model is a good fit

    qqnorm(residuals(mod.1))
    qqline(residuals(mod.1))

enter image description here They residuals are reasonably normal so the model is fine. You can also follow your anova approach and have:

    anova.model <- aov(v ~ labels, data = data))
    summary(anova.model)

             Df Sum Sq Mean Sq F value   Pr(>F)    
 labels       2   7982    3991   21.22 1.29e-07 ***
 Residuals   57  10720     188  

which indicates that there is at least one significant difference between the means of the sand dune ecosystems, and follow up with Tukey for the pointwise intervals:

    post <- TukeyHSD(aov(v ~ labels, data = data))
    plot(post, ylim = c(0, 4))

enter image description here

already adjusted for multiple testing :)

Upvotes: 1

Related Questions