erica_sn
erica_sn

Reputation: 11

mgcv GAM plot and predict with tensor smooth by factor

I am running GAMs to understand the spatio-temporal trends in size counts of a Californian mollusc. I have count data as a response to three-way interactions between space & time (lat,long,year) and north/east currents & time (uo,vo,year) each partitioned out by 3 size classes (small, med, large). Here's the gam:

count_te_model.xy.vo.I = gam(count ~ size_bin +
                                te(latitude, longitude, year, d=c(2,1), by=size_bin) +
                                te(vo, uo, year, d=c(2,1), by=size_bin) +
                                offset(log(plots_sampled)),
                              data=LG_count_plot_mpa_F, family=nb(link="log"), method="REML")


summary(count_te_model.xy.vo.I)

Family: Negative Binomial(2.271) 
Link function: log 

Formula:
count ~ size_bin + te(latitude, longitude, year, d = c(2, 1), 
    by = size_bin) + te(vo, uo, year, d = c(2, 1), by = size_bin) + 
    offset(log(plots_sampled))

Parametric coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.60406    0.02163 120.411   <2e-16 ***
size_binmed    0.30197    0.03050   9.900   <2e-16 ***
size_binsmall  0.04658    0.03093   1.506    0.132    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
                                            edf Ref.df Chi.sq p-value    
te(latitude,longitude,year):size_binlarge 44.64  51.05  630.2  <2e-16 ***
te(latitude,longitude,year):size_binmed   55.82  65.78  563.4  <2e-16 ***
te(latitude,longitude,year):size_binsmall 53.13  60.41  724.4  <2e-16 ***
te(vo,uo,year):size_binlarge              30.58  40.02  105.3  <2e-16 ***
te(vo,uo,year):size_binmed                37.54  49.24  135.8  <2e-16 ***
te(vo,uo,year):size_binsmall              53.13  67.03  266.2  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.429   Deviance explained = 54.4%
-REML =  15736  Scale est. = 1         n = 2944

I now want to map the change in abundance over time per size class, and was wondering if anyone knows how to best do this with 3-way interactions by factor?

I have tried "plot.gam" as the following:

plot(count_te_model.xy.vo.I, all.terms=TRUE, too.far=0.05)

And it produces these plots:

large individual plots

medium individual plots

small individual plots

I've also created an xy grid to predict the gam output on, then map. I am using the predict function as so:

head(predict_count_coast_L)
# A tibble: 6 x 8
  longitude latitude  year size_bin plots_sampled    uo    vo model_fit
      <dbl>    <dbl> <dbl> <chr>            <dbl> <dbl> <dbl>     <dbl>
1     -124.     41.7  1995 large                5     1     1     0.162
2     -124.     41.7  1995 large                5     1     1     0.161
3     -124.     41.7  1995 large                5     1     1     0.160
4     -124.     41.7  1995 large                5     1     1     0.159
5     -124.     41.7  1995 large                5     1     1     0.158
6     -124.     41.7  1995 large                5     1     1     0.157

predict_count_coast_L$model_fit = predict(count_te_model.xy.vo.I,
                                          predict_count_coast_L,type = "link", 
                                          exclude = "te(vo, uo, year, d=c(2,1), by=size_bin)")

ggplot(aes(longitude, latitude, fill= model_fit),
       data=predict_count_coast_L)+
  geom_tile()+
  facet_wrap(~year,nrow=3)+
  scale_fill_viridis("count")+
  ggtitle("large individuals")+
  theme_bw(10)

This produces maps which look like they have quite different patterns than the gam.plot. (*Do note that the years increase from top to bottom now!)

large individual plots

medium individual plots

small individual plots

I'm also still trying to understand the output values with the "link" predict type... the vignette says it 'produces predictions on the scale of the additive predictors' but I am struggling to understand what this actually means. Are these log link values?

I've also tried the above with type as "response" instead of "link", and it gives me even more different patterns (shown just for small individuals here):

small individual plots

If anyone has an idea why these give different outputs, and if there is a preferred way of predicting/plotting GAMs such as these, that would be much appreciated!!

Update #1

Trying to cross validate models

I am comparing different types of hierarchical models as described in this paper:[https://peerj.com/articles/6876/?td=tw] and want to compare them by conducting a cross validation, using even years as testing and odd as training. I am not sure how to back transform the link values onto the original scale of size counts. All models have the family 'nb(link="log")'. I have tried the 'linkinv' function below, but not sure if this is correct, or if I can just do 'exp()'... Any advice would be super helpful!

LG_train <- subset(LG_count_plot_mpa_F, year%%2==0)
LG_test  <- subset(LG_count_plot_mpa_F, year%%2==1) 

LG_predict_m = mutate(
  LG_count_plot_mpa_F,
  lg1_model = as.vector(predict(count_te_model.xy.vo.I, LG_count_plot_mpa_F,type = "link")),
  lg2_model = as.vector(predict(count_te_model.xy.vo.G, LG_count_plot_mpa_F,type = "link")),
  lg3_model = as.vector(predict(count_te_model.xy.vo.GI,LG_count_plot_mpa_F,type = "link")),
  data_type = factor(ifelse(year%%2==0, "training","testing"),
                     levels= c("training","testing"))
)

ilink <- family(count_te_model.xy.vo.I)$linkinv

LG_predict_m_2 = mutate(
  LG_count_plot_mpa_F,
  lg1_link = as.vector(ilink(LG_predict_m$lg1_model)),
  lg2_link = as.vector(ilink(LG_predict_m$lg2_model)),
  lg3_link = as.vector(ilink(LG_predict_m$lg3_model)),
  data_type = factor(ifelse(year%%2==0, "training","testing"),
                     levels= c("training","testing"))
)

LG_predict = gather(LG_predict_m_2,key= model, value= count_est,
                    lg1_link:lg3_link )%>%
  mutate(count_est = as.numeric(count_est))

forecast_accuracy_m = LG_predict %>% 
  group_by(model)%>%
  filter(data_type=="testing")%>%
  summarize(out_of_sample_r2 = round(cor(log(count_est),log(count))^2,2))
print(forecast_accuracy_m)

Upvotes: 1

Views: 1504

Answers (1)

Gavin Simpson
Gavin Simpson

Reputation: 174778

You are plotting two very different things; the plots achieved via plot() are showing the partial effect of the selected smooths (the ones you showed), while you are predicting from the full model, so you include the effects of all variables/terms in the model.

You don't exclude smooths like you are doing; you should include the names of the smooths you want to exclude exactly as they appear in the summary table produced by summary(). So you want:

exclude = c("te(vo,uo,year):size_binlarge",
            "te(vo,uo,year):size_binmed",
            "te(vo,uo,year):size_binlarge")

but even doing that won't work to get you what you want (assuming you are trying to replicate the output from plot()) as there will be the other parametric terms will also be included in the values generated by predict(), and the model intercept, which will result in your plots including the group means also.

I can see a few options:

  1. predict with type = "terms" and identify the column of the resulting matrix associated with each of your three smooth (the factor by smooths) that you want to plot.

  2. you could grab the output from the plot() command and then use the data in that object to plot what you want with ggplot: pdat <- plot(....)

  3. use gratia::smooth_estimates() to evaluate the smooths over a grid of values and then use the object returned by that function with ggplot to do the plotting.

(I hope to have draw() working for 3- and 4-d smooths within {gratia} before the end of the year [2021].)

Upvotes: 1

Related Questions