Zissou
Zissou

Reputation: 13

Model building and stuck on error message in R

I am trying to analyze the data set below

   X wt   solution  process

   1 21        1       1
   2 36        2       1
   3 25        3       1
   4 18        4       1
   5 22        5       1
   6 26        1       2
   7 38        2       2
   8 27        3       2
   9 17        4       2
  10 26        5       2
  11 16        1       3
  12 25        2       3
  13 22        3       3
  14 18        4       3
  15 21        5       3
  16 28        1       4
  17 35        2       4
  18 27        3       4
  19 20        4       4
  20 24        5       4

The data is balanced, and I believe both effects are fixed. My code in R is below

str(wool)

##convert solution and process to factors

wool$solution<-as.factor(wool$solution)
wool$process<-as.factor(wool$process)


m1<-aov(wt~solution*process,data=wool)
plot(m1)

However, when I try to plot the model to check assumptions I get the following Error:

not plotting observations with leverage one:
  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20Error in qqnorm.default(rs, main = main, ylab = ylab23, ylim = ylim, ...) : 
  y is empty or has only NAs

I am not sure how to correct this, when I plot the model without interactions

m1<-aov(wt~solution+process,data=wool)

Everything works just fine, however I need to analyze the interaction to see if it is significant. Also when I just keep the factors as numeric then it works, but these are definitely categorical factors, each refers to a type of process and a type of solution used to treat the observations.

Any help is appreciated.

Upvotes: 1

Views: 1122

Answers (1)

Len Greski
Len Greski

Reputation: 10855

The model with the interaction effect consumes 19 degrees of freedom. The degrees of freedom for the error term in an analysis of variance is N - k where N = total number of observations and k = the number of group effects.

Since N is 20 and the number of groups plus interactions, k, is also 20, the degrees of freedom across the model terms is 19. Therefore, the degrees of freedom for the error term is 0, and the leverage of each observation will be 1, meaning that each of the 19 parameters in the model plus the grand mean are completely dependent on the 20 observations in the data frame. This explains the error message returned the plot() function.

rawData <- "X wt   solution  process

   1 21        1       1
   2 36        2       1
   3 25        3       1
   4 18        4       1
   5 22        5       1
   6 26        1       2
   7 38        2       2
   8 27        3       2
   9 17        4       2
  10 26        5       2
  11 16        1       3
  12 25        2       3
  13 22        3       3
  14 18        4       3
  15 21        5       3
  16 28        1       4
  17 35        2       4
  18 27        3       4
  19 20        4       4
  20 24        5       4"

wool <- read.table(text=rawData,header=TRUE)

str(wool)

##convert solution and process to factors

wool$solution<-as.factor(wool$solution)
wool$process<-as.factor(wool$process)


m1<-aov(wt~solution*process,data=wool)
summary(m1)

...and the output:

> m1<-aov(wt~solution*process,data=wool)
> summary(m1)
                 Df Sum Sq Mean Sq
solution          4  500.8  125.20
process           3  136.8   45.60
solution:process 12   87.2    7.27

The problem becomes clear when we print the ANOVA table with anova(m1).

> anova(m1)
Analysis of Variance Table

Response: wt
                 Df Sum Sq Mean Sq F value Pr(>F)
solution          4  500.8 125.200               
process           3  136.8  45.600               
solution:process 12   87.2   7.267               
Residuals         0    0.0                       
Warning message:
In anova.lm(m1) :
  ANOVA F-tests on an essentially perfect fit are unreliable

The insufficient degrees of freedom problem is even more obvious when we use lm() to fit the model.

m2 <- lm(wt ~ solution*process,data=wool)
summary(m2)

...and the output:

> m2 <- lm(wt ~ solution*process,data=wool)
> summary(m2)

Call:
lm(formula = wt ~ solution * process, data = wool)

Residuals:
ALL 20 residuals are 0: no residual degrees of freedom!

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)              21         NA      NA       NA
solution2                15         NA      NA       NA
solution3                 4         NA      NA       NA
solution4                -3         NA      NA       NA
solution5                 1         NA      NA       NA
process2                  5         NA      NA       NA
process3                 -5         NA      NA       NA
process4                  7         NA      NA       NA
solution2:process2       -3         NA      NA       NA
solution3:process2       -3         NA      NA       NA
solution4:process2       -6         NA      NA       NA
solution5:process2       -1         NA      NA       NA
solution2:process3       -6         NA      NA       NA
solution3:process3        2         NA      NA       NA
solution4:process3        5         NA      NA       NA
solution5:process3        4         NA      NA       NA
solution2:process4       -8         NA      NA       NA
solution3:process4       -5         NA      NA       NA
solution4:process4       -5         NA      NA       NA
solution5:process4       -5         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 19 and 0 DF,  p-value: NA

Interaction effect with continuous variables

Regarding the OP question about the code working when the analysis is run with lm() using continuous variables, with continuous variables an interaction effect consumes a single degree of freedom, not (solutions - 1) * (processes - 1) or 12 degrees of freedom for the interaction between the two categorical variables.

Again, we can demonstrate this with lm().

wool$solution<-as.numeric(wool$solution)
wool$process<-as.numeric(wool$process)
m3 <- lm(wt ~ solution*process,data=wool)
summary(m3)
anova(m3)

...and the output:

> summary(m3)

Call:
lm(formula = wt ~ solution * process, data = wool)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.460  -3.757  -0.180   2.320  12.000 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)       28.9000     8.1454   3.548  0.00268 **
solution          -1.5000     2.4559  -0.611  0.54993   
process           -0.0100     2.9743  -0.003  0.99736   
solution:process   0.0300     0.8968   0.033  0.97373   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.341 on 16 degrees of freedom
Multiple R-squared:  0.1123,    Adjusted R-squared:  -0.05409 
F-statistic: 0.675 on 3 and 16 DF,  p-value: 0.5798

> anova(m3)
Analysis of Variance Table

Response: wt
                 Df Sum Sq Mean Sq F value Pr(>F)
solution          1  81.22  81.225  2.0200 0.1744
process           1   0.16   0.160  0.0040 0.9505
solution:process  1   0.05   0.045  0.0011 0.9737
Residuals        16 643.37  40.211               
> 

Upvotes: 1

Related Questions