Reputation: 41
I am new to R, and only have a basic understanding of statistics. I am learning how to use factorial experiments, and how to fit models to results from Design and Analysis of Experiments (Montgomery 2013, ISBN: 9781118097939).
I used example 5.3 from the textbook. The data can be seen in the code below.
bottel_vul_data <- data.frame(A = rep(c(10, 12, 14), each = 2),
B = rep(c(25, 30), each = 12),
C = rep(c(200, 250), each = 6),
vul = c(-3, -1, 0, 1, 5, 4,
-1, 0, 2, 1, 7, 6,
-1, 0, 2, 3, 7, 9,
1, 1, 6, 5, 10, 11))
# bottel_vul_data
and completed the ANOVA analysis
bottel_anova <- aov(vul ~ factor(A) * B * C, data = bottel_vul_data)
summary(bottel_anova)
which yielded the same results as the textbook
Df Sum Sq Mean Sq F value Pr(>F)
factor(A) 2 252.75 126.38 178.412 1.19e-09 ***
B 1 45.37 45.37 64.059 3.74e-06 ***
C 1 22.04 22.04 31.118 0.00012 ***
factor(A):B 2 5.25 2.63 3.706 0.05581 .
factor(A):C 2 0.58 0.29 0.412 0.67149
B:C 1 1.04 1.04 1.471 0.24859
factor(A):B:C 2 1.08 0.54 0.765 0.48687
Residuals 12 8.50 0.71
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I then plotted the data to visualise it and compared four linear regression models:
The fourth model is supposed to be adequate because it includes all factors and interactions that shows a statistical significant difference. Finally, I completed an ANOVA on the four models.
data <- bottel_vul_data[c('vul', 'A', 'B', 'C')]
plot(data)
par(mfrow = c(2,2))
boxplot(vul~A, data = data, main = '% karbonering')
boxplot(vul~B, data = data, main = 'bedryfsdruk')
boxplot(vul~C, data = data, main = 'lynsnelheid')
boxplot(vul~A*B*C, data = data, main = 'A B C')
par(mfrow = c(2,2))
interaction.plot(data$A,data$B, data$vul)
interaction.plot(data$A,data$C, data$vul)
interaction.plot(data$B,data$C, data$vul)
par(mfrow = c(1,1))
model1 = lm(vul~., data = data)
summary(model1)
model2 = lm(vul~.^2, data = data)
summary(model2)
model3 = lm(vul~.^3, data = data)
summary(model3)
model4 = lm(vul ~ A + B + C + A*B, data = data)
summary(model4)
anova(model1, model2, model3, model4)
The output of the ANOVA is shown below
A anova: 4 × 6
Res.Df RSS Df Sum of Sq F Pr(>F)
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20 21.14583 NA NA NA NA
2 17 14.47917 3 6.666667 2.46628131 0.09958762
3 16 14.41667 1 0.062500 0.06936416 0.79562649
4 19 16.08333 -3 -1.666667 0.61657033 0.61423917
Which suggests that there is no statistical significant difference between Models 3 and 4. Therefore, Model 4 should be the simplest model that describes the data adequately.
I would like to know if there is a way to visualise the fitted models by plotting them along with the real data on a graph in R? I would also like to know if there is a way to check that the model is actually representative of the data? Finally, I want to fit second order models as well so that I can get surface responses, but I have no idea how to do this in R. Is there a function like lm() that I can use?
Upvotes: 0
Views: 65
Reputation: 174468
The problem is really one of demonstrating multiple dimensions of data in a single plot. If you show one of your variables (such as A
) on the x axis, with vul
on the y axis, then you have already used up your two dimensions. It is possible to represent extra dimensions using color, or point size, or alpha, or facets, though it can be hard to show regression results this way.
With this example, we can plot vul
against A
, and since there are only two values of the C
variable, we can facet by C
. To show the effect of B
, we can draw several different regression lines in each panel at different values of B
:
library(ggplot2)
new_df <- expand.grid(A = seq(min(data$A), max(data$A), length.out = 10),
C = c(min(data$C), max(data$C)),
B = seq(min(data$B), max(data$B), length.out = 20))
new_df$vul <- predict(model4, new_df)
ggplot(new_df, aes(A, vul, color = B, group = B)) +
geom_line(alpha = 0.9) +
geom_point(data = data, shape = 21, size = 3, aes(fill = B), color = 'black') +
facet_wrap(paste('C =', C) ~ .) +
scale_color_viridis_c() +
scale_fill_viridis_c() +
theme_bw(base_size = 16)
Note that we have our original data on the plot, with each variable being identifiable by its panel, position and color.
We can see that with our model, vul
increases with A
, since all the lines are up-sloping. We can see vul
increases with C
because the lines in the second panel are just shifted upwards with higher values of C
, and we can see that there is a positive interaction between A
and B
because the lines in each panel get steeper as B gets higher (i.e. they 'fan out').
Note it would be possible to have B
on the facets and C
on the color scale to get an equally valid plot, but it just depends on what you wish to illustrate with your plot. I think the above example demonstrates nicely that both A and C are positively associated with vul
, and that there is an interaction between A
and B
. However, it doesn't show us the direction of B's association with vul
, or any of the intercept values. We could create plots to illustrate these features of the model if they are the things we wanted to demonstrate.
Upvotes: 1