Philip de Bruin
Philip de Bruin

Reputation: 41

Can I visualise models by plotting them with the original data on the same graph in R

I am new to R, and only have a basic understanding of statistics. I am learning how to use factorial experiments, and how to fit models to results from Design and Analysis of Experiments (Montgomery 2013, ISBN: 9781118097939).

I used example 5.3 from the textbook. The data can be seen in the code below.

bottel_vul_data <- data.frame(A = rep(c(10, 12, 14), each = 2),
                              B = rep(c(25, 30), each = 12),
                              C = rep(c(200, 250), each = 6),        
                              vul = c(-3, -1, 0, 1, 5, 4,
                                      -1,  0, 2, 1, 7, 6,   
                                      -1,  0, 2, 3, 7, 9,
                                       1,  1, 6, 5, 10, 11))
# bottel_vul_data

and completed the ANOVA analysis

bottel_anova <- aov(vul ~ factor(A) * B * C, data = bottel_vul_data)
summary(bottel_anova)

which yielded the same results as the textbook

              Df Sum Sq Mean Sq F value   Pr(>F)    
factor(A)      2 252.75  126.38 178.412 1.19e-09 ***
B              1  45.37   45.37  64.059 3.74e-06 ***
C              1  22.04   22.04  31.118  0.00012 ***
factor(A):B    2   5.25    2.63   3.706  0.05581 .  
factor(A):C    2   0.58    0.29   0.412  0.67149    
B:C            1   1.04    1.04   1.471  0.24859    
factor(A):B:C  2   1.08    0.54   0.765  0.48687    
Residuals     12   8.50    0.71                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I then plotted the data to visualise it and compared four linear regression models:

  1. only considering factors A, B and C;
  2. considering factors A, B and C as well as the interactions A:B, A:C and B:C;
  3. considering all factors and interactions;
  4. and finally considering factors A, B, C and the interaction A:B.

The fourth model is supposed to be adequate because it includes all factors and interactions that shows a statistical significant difference. Finally, I completed an ANOVA on the four models.

data <- bottel_vul_data[c('vul', 'A', 'B', 'C')]
plot(data)

par(mfrow = c(2,2))
boxplot(vul~A, data = data, main = '% karbonering')
boxplot(vul~B, data = data, main = 'bedryfsdruk')
boxplot(vul~C, data = data, main = 'lynsnelheid')
boxplot(vul~A*B*C, data = data, main = 'A B C')

par(mfrow = c(2,2))
interaction.plot(data$A,data$B, data$vul)
interaction.plot(data$A,data$C, data$vul)
interaction.plot(data$B,data$C, data$vul)
par(mfrow = c(1,1))

model1 = lm(vul~., data = data)
summary(model1)

model2 = lm(vul~.^2, data = data)
summary(model2)

model3 = lm(vul~.^3, data = data)
summary(model3)

model4 = lm(vul ~ A + B + C + A*B, data = data)
summary(model4)

anova(model1, model2, model3, model4)

The output of the ANOVA is shown below

A anova: 4 × 6
Res.Df  RSS Df  Sum of Sq   F   Pr(>F)
<dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1   20  21.14583    NA  NA  NA  NA
2   17  14.47917    3   6.666667    2.46628131  0.09958762
3   16  14.41667    1   0.062500    0.06936416  0.79562649
4   19  16.08333    -3  -1.666667   0.61657033  0.61423917

Which suggests that there is no statistical significant difference between Models 3 and 4. Therefore, Model 4 should be the simplest model that describes the data adequately.

I would like to know if there is a way to visualise the fitted models by plotting them along with the real data on a graph in R? I would also like to know if there is a way to check that the model is actually representative of the data? Finally, I want to fit second order models as well so that I can get surface responses, but I have no idea how to do this in R. Is there a function like lm() that I can use?

Upvotes: 0

Views: 65

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174468

The problem is really one of demonstrating multiple dimensions of data in a single plot. If you show one of your variables (such as A) on the x axis, with vul on the y axis, then you have already used up your two dimensions. It is possible to represent extra dimensions using color, or point size, or alpha, or facets, though it can be hard to show regression results this way.

With this example, we can plot vul against A, and since there are only two values of the C variable, we can facet by C. To show the effect of B, we can draw several different regression lines in each panel at different values of B:

library(ggplot2)

new_df <- expand.grid(A = seq(min(data$A), max(data$A), length.out = 10),
                      C = c(min(data$C), max(data$C)),
                      B = seq(min(data$B), max(data$B), length.out = 20))

new_df$vul <- predict(model4, new_df)

ggplot(new_df, aes(A, vul, color = B, group = B)) +
  geom_line(alpha = 0.9) +
  geom_point(data = data, shape = 21, size = 3, aes(fill = B), color = 'black') +
  facet_wrap(paste('C =', C) ~ .) +
  scale_color_viridis_c() +
  scale_fill_viridis_c() +
  theme_bw(base_size = 16)

enter image description here

Note that we have our original data on the plot, with each variable being identifiable by its panel, position and color.

We can see that with our model, vul increases with A, since all the lines are up-sloping. We can see vul increases with C because the lines in the second panel are just shifted upwards with higher values of C, and we can see that there is a positive interaction between A and B because the lines in each panel get steeper as B gets higher (i.e. they 'fan out').

Note it would be possible to have B on the facets and C on the color scale to get an equally valid plot, but it just depends on what you wish to illustrate with your plot. I think the above example demonstrates nicely that both A and C are positively associated with vul, and that there is an interaction between A and B. However, it doesn't show us the direction of B's association with vul, or any of the intercept values. We could create plots to illustrate these features of the model if they are the things we wanted to demonstrate.

Upvotes: 1

Related Questions