Cathrin
Cathrin

Reputation: 51

Manual interaction plot linear regression in R

I am trying to predict the mean abundance of animals sighted during different moon phases (factor), using log-transformed abundance data (better fit) and some other variables. The best model (lowest AIC) turned out to include the interaction of phase and survey duration and the cloud cover (both continuous):

LMoon<-lm(ln~Phase*Duration+Clouds, data=abund)

summary(LMoon)

Call:
lm(formula = ln ~ Phase * Duration + Clouds, data = abund)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75416 -0.46311  0.09522  0.46591  1.85978 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.382031   0.876865   0.436 0.664125    
Phase2            2.130065   1.226305   1.737 0.085851 .  
Phase3            1.971060   1.818542   1.084 0.281351    
Phase4            0.608043   1.140122   0.533 0.595146    
Phase5            4.786674   1.151850   4.156 7.44e-05 ***
Phase6            0.958706   1.046831   0.916 0.362238    
Phase7            0.254711   3.425214   0.074 0.940888    
Phase8            0.865995   1.043916   0.830 0.409005    
Duration          0.069153   0.035407   1.953 0.053952 .  
Clouds           -0.004259   0.002401  -1.774 0.079494 .  
Phase2:Duration  -0.087843   0.047818  -1.837 0.069545 .  
Phase3:Duration  -0.089908   0.069652  -1.291 0.200109    
Phase4:Duration  -0.005424   0.046675  -0.116 0.907749    
Phase5:Duration  -0.172016   0.049369  -3.484 0.000768 ***
Phase6:Duration  -0.035597   0.041435  -0.859 0.392583    
Phase7:Duration   0.024084   0.176773   0.136 0.891939    
Phase8:Duration  -0.033424   0.042064  -0.795 0.428963    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7721 on 89 degrees of freedom
Multiple R-squared:  0.3368,    Adjusted R-squared:  0.2176 
F-statistic: 2.825 on 16 and 89 DF,  p-value: 0.0009894

Now, due to this interaction, I need to make an interaction plot (CI are too wide when plotting the lsmeans). I've tried to use different functions mentioned out there but none of them worked. Apparently I need to calculate and plot manually which I did like this:

intercepts <- c(coef(LMoon)["(Intercept)"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase2"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase3"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase4"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase5"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase6"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase7"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase8"])

lines.df <- data.frame(intercepts = intercepts,
                       slopes = c(coef(LMoon)["Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase2:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase3:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase4:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase5:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase6:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase7:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase8:Duration"]),
                       Phase2 = levels(abund$Phase))

qplot(x = Duration, y = Sp2, color = Phase, data = abund) + 
  geom_abline(aes(intercept = intercepts, 
                  slope = slopes, 
                  color = Phase), data = lines.df)

The plot that I get is wrong as the y-values are on the original, true scale, but the lines are based on the lm which uses the log-transformed data.

interaction plot abundance, duration, lunar phases

To back-transform this someone told me that I won't get straight lines in the end, actually. Rather than using abline(), I should create a set of e.g. 100 new x values that cover the range of the duration data and use the coefficients to calculate your predicted y values. Then plot these using lines() and it should look like a smooth curve.

And this is where I get lost.

So I created a the set of new x-values for the range of the survey duration (min 15 max 45 min): dur2 <- seq(from = 15, to = 45, length.out=100)

Then once I got those values I am supposed to get the predicted y value for each x-value using the coefficients of my LM. After that, back-transforming the y-values to the original scale. And then using the x-values and the back-transformed y-values to add the lines to the plot.

How do I get the predicted values now exactly? I cannot use any pred type/function, I've tried it all. It just doesn't work with my model, so manual is the only way, but I have no clue how...

Hope anyone can help me with this, I've been trying for weeks and in despair by now, close to giving up.

Cheers!

PS Here the data:

> dput(subset(abund, Phase %in% c("Phase1", "Phase2")))

structure(list(Year = integer(0), Date = structure(integer(0), .Label = c("01/08/2009", 
"01/08/2016", "02/07/2019", "02/08/2009", "02/08/2012", "02/08/2016", 
"02/09/2007", "03/08/2007", "03/08/2009", "03/08/2014", "03/08/2015", 
"04/07/2019", "04/08/2009", "04/08/2013", "05/08/2009", "05/08/2014", 
"05/08/2015", "06/07/2008", "06/07/2019", "07/08/2009", "08/07/2010", 
"09/07/2010", "09/08/2015", "10/08/2009", "11/08/2009", "12/08/2009", 
"13/08/2009", "13/08/2014", "14/08/2009", "14/08/2012", "16/07/2006", 
"18/07/2009", "18/08/2015", "19/07/2011", "20/08/2009", "21/07/2011", 
"21/09/2009", "22/07/2011", "22/07/2016", "22/07/2017", "23/07/2007", 
"23/07/2016", "23/07/2017", "24/07/2017", "25/07/2007", "25/07/2010", 
"25/07/2017", "25/08/2016", "26/07/2010", "26/07/2011", "27/07/2006", 
"27/07/2011", "27/07/2012", "28/07/2016", "29/06/2019", "29/07/2005", 
"29/07/2009", "29/07/2010", "29/07/2016", "29/07/2019", "30/07/2005", 
"30/07/2007", "30/07/2016", "30/08/2005", "31/07/2005", "31/07/2009", 
"31/07/2014", "31/07/2016"), class = "factor"), NrSurvey = integer(0), 
    Duration = integer(0), Sp2 = integer(0), Phase = structure(integer(0), .Label = c("1", 
    "2", "3", "4", "5", "6", "7", "8"), class = "factor"), Clouds = integer(0), 
    Visibility = integer(0), ln = numeric(0)), row.names = integer(0), class = "data.frame")

Upvotes: 1

Views: 428

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 146090

Use predict to get predicted values. Don't calculate manually.

Use expand.grid() to generate a data frame of all combinations of your dur2 sequence and the other predictors at the value(s) you want plot. Something like this:

prediction_data = expand.grid(
  Duration = dur2,
  Phase= unique(abund$Phase),
  Clouds = mean(abund$Clouds) # Hold Clouds constant at some value
)

# column names in prediction_data need to match those in the model formula
prediction_data$pred = predict(LMoon, newdata = prediction_data)
prediction_data$pred_orig = exp(prediction_data$pred)

# plot
ggplot(prediction_data, aes(x = Duration, y = pred_orig, color = Phase)) +
  geom_line() +
  geom_point(data = abund)

Something like that should work.

Another nice option is to use broom::augment to generate the predictions. This can easily give standard errors and residuals for each prediction point as well.

library(broom)
prediction_data = augment(LMoon, newdata = prediction_data)

Upvotes: 2

Related Questions