David
David

Reputation: 37

GAM model with multidimensional categorical response variable (with mgcv)

I try to fit a GAM model with a multidimensional categorical response variable combining three binomial dimensions (e.g., A-B by a-b by α-β) lacking certain combinations and depending on a smooth of longitude and latitude by employing the package mgcv. Is there a distribution family or syntax that allows to model the effect of predictor variables on the individual dimensions of the response variable in a single model or do I have to model each dimension separately?

sample data

library(dplyr)

df <- data.frame(response.dim1 = sample(c("A", "B"), 1000, replace=TRUE),
                 response.dim2 = sample(c("a", "b"), 1000, replace=TRUE),
                 response.dim3 = sample(c("", "α", "β"), 1000, replace=TRUE),
                 longitude = runif(1000, min = 0, max = 100),
                 latitude = runif(1000, min = 50, max = 150))

df$response.comb <- paste(df$response.dim1, df$response.dim2, df$response.dim3, sep = "")

df <- df %>%
  filter(response.comb %in% c("Aa", "Abα", "Abβ", "Bbα", "Bbβ"))

I want a working model along the lines of gam(response.dim1 * response.dim2+3 ~ s(longitude, latitude).

Or is there a way to compute the interactions of response dimension 1 and 2 with gam(response.comb ~ s(longitude, latitude)?

Upvotes: 1

Views: 80

Answers (1)

qdread
qdread

Reputation: 3943

Here is a fairly basic way to fit a binomial and multinomial generalized additive mixed model (GAMM) using the R package brms.

Simulate data

First, simulate some data with three response classes. Response class 2 always has the same probability, 0.1. Response class 1 has a probability depending on the sine of time, plus a random effect and a fixed effect. Response class 3 probability is 1 - the sum of p1 and p2.

We have to encode the responses as numerical columns, one for each category, with the number of outcomes for that category. I simulated it so that there are 10 trials for each combination of fixed and random effects at each time point.

set.seed(1)

library(brms)

# Simulate some data, assume sinusoidal pattern with time, a random effect due to block, and fixed effect of treatment
sim_data <- expand.grid(t = 1:10, block = factor(letters[1:5]), treatment = factor(c('control', 'treated')))
p1 <- plogis(sin(sim_data$t) + c(-2, -1.5, -1, -.5, 0)[as.numeric(sim_data$block)] + 0.5*as.numeric(sim_data$treatment))
probs <- cbind(p1, 0.1, 1 - p1 - 0.1)

categories <- t(apply(probs, 1, function(ps) rmultinom(1, 10, ps)))
categories <- cbind(categories, rowSums(categories))
dimnames(categories)[[2]] <- c('animal', 'vegetable', 'mineral', 'n_trials')

sim_data <- cbind(sim_data, categories)

Multinomial GLMM

Next, we fit a multinomial model. The syntax for the response part of the formula is cbind(<names of the response categories>) | trials(<name of the column with the number of trials>). The syntax for the fixed and random effects is similar to lme4 syntax, and the smooth term uses the same arguments as mgcv::s(). Note: I specified a thin-plate spline, bs = 'tp' with basis dimension k = 5, but you should read about the options and carefully decide what is appropriate for you.

The family argument tells brm() that we are modeling a multinomially distributed response with a (generalized) logit link, and we can also specify which category is the reference category.

If your response is categorical but not ordinal, you may use family = categorical(link = 'logit') instead of family = multinomial(link = 'logit'). That will perform softmax regression.

Note also that I did not change any defaults for prior distributions, sampling iterations, or anything like that. You should definitely not blindly accept those defaults! A discussion of that is outside the scope of this answer but there is ample material online about it.

Here's the code:

multinom_fit <- brm(cbind(animal, vegetable, mineral) | trials(n_trials) ~ treatment + s(t, k = 5, bs = 'tp') + (1 | block),
                    family = multinomial(link = 'logit', refcat = 'animal'),
                    data = sim_data)

We can call conditional_effects(multinom_fit, categorical = TRUE) to get a quick look at the model predictions for each fixed effect. There are other diagnostics you should run, again a discussion of them is outside the scope of this answer!

It looks like the model is doing a good job of capturing the fixed effect and the sinusoidal time effect on the probabilities of response class 1 (animal) and 3 (mineral) with a close to zero fixed effect time effect on the probability of response class 2 (vegetable), as we simulated!

multinom conditional plot fixed effect

multinom conditional plot spline effect

Binomial GLMM

We can use the same simulated data to demonstrate the syntax. I won't go through any output but we get similar inference from just modeling the probability of one of the three response classes. The syntax of the response is just <response column name> | trials(<trial column name>) and we specify family = binomial(link = 'logit').

binom_fit <- brm(animal | trials(n_trials) ~ treatment + s(t, k = 5, bs = 'tp') + (1 | block),
                 family = binomial(link = 'logit'),
                 data = sim_data)

Upvotes: 1

Related Questions