Reputation: 37
I try to fit a GAM model with a multidimensional categorical response variable combining three binomial dimensions (e.g., A-B
by a-b
by α-β
) lacking certain combinations and depending on a smooth of longitude and latitude by employing the package mgcv. Is there a distribution family or syntax that allows to model the effect of predictor variables on the individual dimensions of the response variable in a single model or do I have to model each dimension separately?
sample data
library(dplyr)
df <- data.frame(response.dim1 = sample(c("A", "B"), 1000, replace=TRUE),
response.dim2 = sample(c("a", "b"), 1000, replace=TRUE),
response.dim3 = sample(c("", "α", "β"), 1000, replace=TRUE),
longitude = runif(1000, min = 0, max = 100),
latitude = runif(1000, min = 50, max = 150))
df$response.comb <- paste(df$response.dim1, df$response.dim2, df$response.dim3, sep = "")
df <- df %>%
filter(response.comb %in% c("Aa", "Abα", "Abβ", "Bbα", "Bbβ"))
I want a working model along the lines of gam(response.dim1 * response.dim2+3 ~ s(longitude, latitude)
.
Or is there a way to compute the interactions of response dimension 1 and 2 with gam(response.comb ~ s(longitude, latitude)
?
Upvotes: 1
Views: 80
Reputation: 3943
Here is a fairly basic way to fit a binomial and multinomial generalized additive mixed model (GAMM) using the R package brms.
First, simulate some data with three response classes. Response class 2 always has the same probability, 0.1. Response class 1 has a probability depending on the sine of time, plus a random effect and a fixed effect. Response class 3 probability is 1 - the sum of p1 and p2.
We have to encode the responses as numerical columns, one for each category, with the number of outcomes for that category. I simulated it so that there are 10 trials for each combination of fixed and random effects at each time point.
set.seed(1)
library(brms)
# Simulate some data, assume sinusoidal pattern with time, a random effect due to block, and fixed effect of treatment
sim_data <- expand.grid(t = 1:10, block = factor(letters[1:5]), treatment = factor(c('control', 'treated')))
p1 <- plogis(sin(sim_data$t) + c(-2, -1.5, -1, -.5, 0)[as.numeric(sim_data$block)] + 0.5*as.numeric(sim_data$treatment))
probs <- cbind(p1, 0.1, 1 - p1 - 0.1)
categories <- t(apply(probs, 1, function(ps) rmultinom(1, 10, ps)))
categories <- cbind(categories, rowSums(categories))
dimnames(categories)[[2]] <- c('animal', 'vegetable', 'mineral', 'n_trials')
sim_data <- cbind(sim_data, categories)
Next, we fit a multinomial model. The syntax for the response part of the formula is cbind(<names of the response categories>) | trials(<name of the column with the number of trials>)
. The syntax for the fixed and random effects is similar to lme4 syntax, and the smooth term uses the same arguments as mgcv::s()
. Note: I specified a thin-plate spline, bs = 'tp'
with basis dimension k = 5
, but you should read about the options and carefully decide what is appropriate for you.
The family
argument tells brm()
that we are modeling a multinomially distributed response with a (generalized) logit link, and we can also specify which category is the reference category.
If your response is categorical but not ordinal, you may use family = categorical(link = 'logit')
instead of family = multinomial(link = 'logit')
. That will perform softmax regression.
Note also that I did not change any defaults for prior distributions, sampling iterations, or anything like that. You should definitely not blindly accept those defaults! A discussion of that is outside the scope of this answer but there is ample material online about it.
Here's the code:
multinom_fit <- brm(cbind(animal, vegetable, mineral) | trials(n_trials) ~ treatment + s(t, k = 5, bs = 'tp') + (1 | block),
family = multinomial(link = 'logit', refcat = 'animal'),
data = sim_data)
We can call conditional_effects(multinom_fit, categorical = TRUE)
to get a quick look at the model predictions for each fixed effect. There are other diagnostics you should run, again a discussion of them is outside the scope of this answer!
It looks like the model is doing a good job of capturing the fixed effect and the sinusoidal time effect on the probabilities of response class 1 (animal) and 3 (mineral) with a close to zero fixed effect time effect on the probability of response class 2 (vegetable), as we simulated!
We can use the same simulated data to demonstrate the syntax. I won't go through any output but we get similar inference from just modeling the probability of one of the three response classes. The syntax of the response is just <response column name> | trials(<trial column name>)
and we specify family = binomial(link = 'logit')
.
binom_fit <- brm(animal | trials(n_trials) ~ treatment + s(t, k = 5, bs = 'tp') + (1 | block),
family = binomial(link = 'logit'),
data = sim_data)
Upvotes: 1