Reputation: 11
I am running GAMs to understand the spatio-temporal trends in size counts of a Californian mollusc. I have count data as a response to three-way interactions between space & time (lat,long,year) and north/east currents & time (uo,vo,year) each partitioned out by 3 size classes (small, med, large). Here's the gam:
count_te_model.xy.vo.I = gam(count ~ size_bin +
te(latitude, longitude, year, d=c(2,1), by=size_bin) +
te(vo, uo, year, d=c(2,1), by=size_bin) +
offset(log(plots_sampled)),
data=LG_count_plot_mpa_F, family=nb(link="log"), method="REML")
summary(count_te_model.xy.vo.I)
Family: Negative Binomial(2.271)
Link function: log
Formula:
count ~ size_bin + te(latitude, longitude, year, d = c(2, 1),
by = size_bin) + te(vo, uo, year, d = c(2, 1), by = size_bin) +
offset(log(plots_sampled))
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.60406 0.02163 120.411 <2e-16 ***
size_binmed 0.30197 0.03050 9.900 <2e-16 ***
size_binsmall 0.04658 0.03093 1.506 0.132
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df Chi.sq p-value
te(latitude,longitude,year):size_binlarge 44.64 51.05 630.2 <2e-16 ***
te(latitude,longitude,year):size_binmed 55.82 65.78 563.4 <2e-16 ***
te(latitude,longitude,year):size_binsmall 53.13 60.41 724.4 <2e-16 ***
te(vo,uo,year):size_binlarge 30.58 40.02 105.3 <2e-16 ***
te(vo,uo,year):size_binmed 37.54 49.24 135.8 <2e-16 ***
te(vo,uo,year):size_binsmall 53.13 67.03 266.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.429 Deviance explained = 54.4%
-REML = 15736 Scale est. = 1 n = 2944
I now want to map the change in abundance over time per size class, and was wondering if anyone knows how to best do this with 3-way interactions by factor?
I have tried "plot.gam" as the following:
plot(count_te_model.xy.vo.I, all.terms=TRUE, too.far=0.05)
And it produces these plots:
I've also created an xy grid to predict the gam output on, then map. I am using the predict function as so:
head(predict_count_coast_L)
# A tibble: 6 x 8
longitude latitude year size_bin plots_sampled uo vo model_fit
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 -124. 41.7 1995 large 5 1 1 0.162
2 -124. 41.7 1995 large 5 1 1 0.161
3 -124. 41.7 1995 large 5 1 1 0.160
4 -124. 41.7 1995 large 5 1 1 0.159
5 -124. 41.7 1995 large 5 1 1 0.158
6 -124. 41.7 1995 large 5 1 1 0.157
predict_count_coast_L$model_fit = predict(count_te_model.xy.vo.I,
predict_count_coast_L,type = "link",
exclude = "te(vo, uo, year, d=c(2,1), by=size_bin)")
ggplot(aes(longitude, latitude, fill= model_fit),
data=predict_count_coast_L)+
geom_tile()+
facet_wrap(~year,nrow=3)+
scale_fill_viridis("count")+
ggtitle("large individuals")+
theme_bw(10)
This produces maps which look like they have quite different patterns than the gam.plot. (*Do note that the years increase from top to bottom now!)
I'm also still trying to understand the output values with the "link" predict type... the vignette says it 'produces predictions on the scale of the additive predictors' but I am struggling to understand what this actually means. Are these log link values?
I've also tried the above with type as "response" instead of "link", and it gives me even more different patterns (shown just for small individuals here):
If anyone has an idea why these give different outputs, and if there is a preferred way of predicting/plotting GAMs such as these, that would be much appreciated!!
Update #1
Trying to cross validate models
I am comparing different types of hierarchical models as described in this paper:[https://peerj.com/articles/6876/?td=tw] and want to compare them by conducting a cross validation, using even years as testing and odd as training. I am not sure how to back transform the link values onto the original scale of size counts. All models have the family 'nb(link="log")'. I have tried the 'linkinv' function below, but not sure if this is correct, or if I can just do 'exp()'... Any advice would be super helpful!
LG_train <- subset(LG_count_plot_mpa_F, year%%2==0)
LG_test <- subset(LG_count_plot_mpa_F, year%%2==1)
LG_predict_m = mutate(
LG_count_plot_mpa_F,
lg1_model = as.vector(predict(count_te_model.xy.vo.I, LG_count_plot_mpa_F,type = "link")),
lg2_model = as.vector(predict(count_te_model.xy.vo.G, LG_count_plot_mpa_F,type = "link")),
lg3_model = as.vector(predict(count_te_model.xy.vo.GI,LG_count_plot_mpa_F,type = "link")),
data_type = factor(ifelse(year%%2==0, "training","testing"),
levels= c("training","testing"))
)
ilink <- family(count_te_model.xy.vo.I)$linkinv
LG_predict_m_2 = mutate(
LG_count_plot_mpa_F,
lg1_link = as.vector(ilink(LG_predict_m$lg1_model)),
lg2_link = as.vector(ilink(LG_predict_m$lg2_model)),
lg3_link = as.vector(ilink(LG_predict_m$lg3_model)),
data_type = factor(ifelse(year%%2==0, "training","testing"),
levels= c("training","testing"))
)
LG_predict = gather(LG_predict_m_2,key= model, value= count_est,
lg1_link:lg3_link )%>%
mutate(count_est = as.numeric(count_est))
forecast_accuracy_m = LG_predict %>%
group_by(model)%>%
filter(data_type=="testing")%>%
summarize(out_of_sample_r2 = round(cor(log(count_est),log(count))^2,2))
print(forecast_accuracy_m)
Upvotes: 1
Views: 1504
Reputation: 174778
You are plotting two very different things; the plots achieved via plot()
are showing the partial effect of the selected smooths (the ones you showed), while you are predicting from the full model, so you include the effects of all variables/terms in the model.
You don't exclude smooths like you are doing; you should include the names of the smooths you want to exclude exactly as they appear in the summary table produced by summary()
. So you want:
exclude = c("te(vo,uo,year):size_binlarge",
"te(vo,uo,year):size_binmed",
"te(vo,uo,year):size_binlarge")
but even doing that won't work to get you what you want (assuming you are trying to replicate the output from plot()
) as there will be the other parametric terms will also be included in the values generated by predict()
, and the model intercept, which will result in your plots including the group means also.
I can see a few options:
predict
with type = "terms"
and identify the column of the resulting matrix associated with each of your three smooth (the factor by smooths) that you want to plot.
you could grab the output from the plot()
command and then use the data in that object to plot what you want with ggplot: pdat <- plot(....)
use gratia::smooth_estimates()
to evaluate the smooths over a grid of values and then use the object returned by that function with ggplot to do the plotting.
(I hope to have draw()
working for 3- and 4-d smooths within {gratia} before the end of the year [2021].)
Upvotes: 1