Cathrin
Cathrin

Reputation: 51

Calculating predicted values for interaction plot incl categorial factors

I am trying to predict the presence of a spieces sighted during different moon phases using glm() and family=binomial, incl a couple of other environmental variables such as seastate, wind direction and survey duration. The best model (lowest AIC) is:

> GLMoon <- glm(Gg~Phase*Duration+WindDir+Seastate, data=Pres, family=binomial)
> anova(GLMoon, test="Chisq")

Analysis of Deviance Table

Model: binomial, link: logit

Response: Gg

Terms added sequentially (first to last)


                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               5783     7917.1              
Phase              7   16.306      5776     7900.8   0.02246 *  
Duration           1   58.762      5775     7842.0 1.779e-14 ***
WindDir            8   44.156      5767     7797.9 5.315e-07 ***
Seastate           3  103.786      5764     7694.1 < 2.2e-16 ***
Phase:Duration     7   18.191      5757     7675.9   0.01114 * 

Since the interaction is significant I would like to plot this interaction as follows (based on a previous question, but on abundance data, see: Manual interaction plot linear regression in R)

> #range of survey duration 15 - 45 min
> dur2 <- seq(from = 15, to = 45, length.out=100) 
> 
> # use predict to get predicted values and expand.grid() to generate a data frame of all combinations of your dur2 sequence 
> 
> prediction_data = expand.grid(
+   Duration = dur2,
+   Phase = unique(Pres$Phase),
+   WindDir = unique(Pres$WindDir),
+   Seastate = unique(Pres$Seastate)
+ )
> 
> # column names in prediction_data need to match those in the model formula
> prediction_data$pred = predict(GLMoon, newdata = prediction_data)
> prediction_data$predorg = exp(prediction_data$pred)
> 
> prediction_data
    Duration Phase    WindDir Seastate         pred   predorg
1   15.00000        5       S        1 -0.203011901 0.8162685
2   16.03448        5       S        1 -0.176720796 0.8380137
3   17.06897        5       S        1 -0.150429692 0.8603382
4   18.10345        5       S        1 -0.124138587 0.8832574
5   19.13793        5       S        1 -0.097847482 0.9067872
6   20.17241        5       S        1 -0.071556378 0.9309438
7   21.20690        5       S        1 -0.045265273 0.9557439
8   22.24138        5       S        1 -0.018974169 0.9812047
9   23.27586        5       S        1  0.007316936 1.0073438
10  24.31034        5       S        1  0.033608041 1.0341792
11  25.34483        5       S        1  0.059899145 1.0617295
12  26.37931        5       S        1  0.086190250 1.0900137
13  27.41379        5       S        1  0.112481355 1.1190514
...
 [ reached 'max' ...

This has worked with the abundance data (lm()), but now due to categorial factors included in the model I get multiple duration values that are the same for each factor basically, see:

screenshot predicted_data

I cannot average those factors when using the expand.grid() to calculate predicted values, so I am not sure how I can fix this or if this is even possible at all? The problem is that when plotting this the predicted value points get connected vertically due to multiple values of the same duration (but different seastate/wind direction).

> library(viridis)  
> plot.interact1 <- prediction_data %>%
+   ggplot(aes(x=Duration, y=predorg, group=Phase, colour=Phase)) +
+   geom_line(size=0.5) +
+   guides(color=guide_legend(title="Lunar Phase")) +
+   ylab("Predicted mean presence") +
+   xlab("Survey duration") +
+   labs(fill="Lunar Phases")
> 
> plot.interact1

Interaction plot

Upvotes: 0

Views: 50

Answers (1)

Cathrin
Cathrin

Reputation: 51

Worked it out -- needed to aggregate before back transforming the data like:

> prediction_data = expand.grid(
+   Duration = dur2,
+   Phase = unique(Pres$Phase),
+   WindDir = unique(Pres$WindDir),
+   Seastate = unique(Pres$Seastate)
+ )

> # column names in prediction_data need to match those in the model formula

> prediction_data$pred = predict(GLMoon, newdata = prediction_data)
> prediction_data
    Duration Phase    WindDir Seastate         pred
1   15.00000        5       S        1 -0.203011901
2   15.30303        5       S        1 -0.195310466
3   15.60606        5       S        1 -0.187609032
4   15.90909        5       S        1 -0.179907597
5   16.21212        5       S        1 -0.172206162
6   16.51515        5       S        1 -0.164504728
7   16.81818        5       S        1 -0.156803293
8   17.12121        5       S        1 -0.149101858
9   17.42424        5       S        1 -0.141400423
10  17.72727        5       S        1 -0.133698989
...
 [ reached 'max' / getOption("max.print") -- omitted 31800 rows ]

> # aggregate before back transform -- average over sea state and wind direction
> mean.pred <- aggregate(list(predmean=prediction_data$pred), by=list(Duration=prediction_data$Duration, Phase2.2=prediction_data$Phase2.2), mean, na.rm=TRUE)

> #back transform predicted means
> mean.pred$predmeanorg <- exp(mean.pred$predmean)
> mean.pred
    Duration Phase         predmean predmeanorg
1   15.00000        1 -0.5281409628   0.5897002
2   15.30303        1 -0.5136241636   0.5983232
3   15.60606        1 -0.4991073644   0.6070723
4   15.90909        1 -0.4845905652   0.6159493
5   16.21212        1 -0.4700737660   0.6249562
6   16.51515        1 -0.4555569668   0.6340947
7   16.81818        1 -0.4410401676   0.6433669
8   17.12121        1 -0.4265233684   0.6527746
9   17.42424        1 -0.4120065692   0.6623199
10  17.72727        1 -0.3974897700   0.6720048
...
 [ reached 'max' / getOption("max.print") -- omitted 550 rows ]

Upvotes: 0

Related Questions