dan
dan

Reputation: 6304

Fitting a multinomial glm for a very large dataset

I have multinomial compositional data for 100 categories from two groups, where each is represented by two ages:

set.seed(1)
df <- data.frame(group = c(rep("g1",200),rep("g2",200)),
                 age = c(rep("a1",100),rep("a2",100),rep("a1",100),rep("a2",100)),
                 category = rep(paste0("c",1:100),4),
                 n = c(rmultinom(1,7000,pgamma(shape=0.8,rate=0.1,q=seq(0.01,1,0.01))),
                       rmultinom(1,5000,pgamma(shape=0.8,rate=0.3,q=seq(0.01,1,0.01))),
                       rmultinom(1,1800,pgamma(shape=0.5,rate=0.1,q=seq(0.01,1,0.01))),
                       rmultinom(1,1200,pgamma(shape=0.9,rate=0.1,q=seq(0.01,1,0.01)))),
                 stringsAsFactors = F)

I want to fit a regression model to estimate the interaction effect of the category * group, while controlling for age.

So far, I'm trying to use a multicategorical glm (with a binomial(link = 'logit')), to a data.frame where I transform the df$n (total counts) to a binomial (binary) form:

binomial.df <- do.call(rbind,lapply(unique(df$group),function(g){
  do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$age),function(a){
    do.call(rbind,lapply(unique(dplyr::filter(df,group == g)$category),function(t){
      sum.non.category <- sum(dplyr::filter(df,group == g & age == a & category != t)$n)
      sum.category <- sum(dplyr::filter(df,group == g & age == a & category == t)$n)
      data.frame(group = g,age = a,category = t,assigned.category = c(rep(0,sum.non.category),rep(1,sum.category)))
    })) 
  }))
}))

binomial.df$group <- factor(binomial.df$group, levels = c("g1","g2"))
binomial.df$age <- factor(binomial.df$age, levels = c("a1","a2"))
binomial.df$category <- factor(binomial.df$category, levels = paste0("c",1:100))

mm.fit <- glm(assigned.category ~ category * group + age,data = binomial.df, family = binomial(link = 'logit'))

Clearly for this size of data the glm call will run for days or even longer, so I'm looking for a more tractable way.

Any idea?

BTW, I tried using nnet's multinom first:

df$group <- factor(df$group, levels = c("g1","g2"))
df$age <- factor(df$age, levels = c("a1","a2"))
df$category <- factor(df$category, levels = paste0("c",1:100))
mm.fit <- nnet::multinom(n ~ category * group + age, data=df)

But I get:

Error in nnet.default(X, Y, w, mask = mask, size = 0, skip = TRUE, softmax = TRUE,  : 
  too many (22220) weights

Upvotes: 0

Views: 382

Answers (1)

Tom Wenseleers
Tom Wenseleers

Reputation: 7989

The nnet::multinom issue you can resolve by modifying your multinom call to include the argument MaxNWts=100000 :

mm.fit <- nnet::multinom(n ~ category * group + age, data=df, MaxNWts=100000)

To fit large multinomial models in R you could also look into the h20 package :

https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

Upvotes: 0

Related Questions