Marco Pastor Mayo
Marco Pastor Mayo

Reputation: 853

Estimate Dirichlet parameters from multinomial regression

I am trying to use the predicted probabilities from a multinomial regression using multinom function from the nnet package in R to estimate the parameters for a Dirichlet distribution while taking into account the standard errors. Below is an example of what I'm trying to do.

I have two sets of categorical data as show below. As you can see, they have the same ratios but the second dataset has 10 times more observations.

df1 <- data.frame(x=c(rep("A", 100), rep("B", 200), rep("C", 300)))
df2 <- data.frame(x=c(rep("A", 1000), rep("B", 2000), rep("C", 3000)))

I conduct a simple multinomial regression that only uses the intercept.

library(nnet)
mn1 <- multinom(x~1, data = df1)
mn2 <- multinom(x~1, data = df2)

summary(mn1)
summary(mn2)

The predicted probabilities are calculated based on the coefficients. Since I'm only using the intercept, we could simply look at the distribution of the data to calculate the probabilities, of course. But the reason I want to use the multinomial regression is to use the standard errors from the models.

softmax <- function(x) {
  exp(x) / sum(exp(x))
}

pb1 <- softmax(c(0, coef(mn1)[,1]))
pb2 <- softmax(c(0, coef(mn2)[,1]))
pb1; pb2

Then, using the rdirichlet function from the dirmult package, I want to run simulations of the Dirichlet function that would derive from the multinomial regression. Here, I use the predicted probabilities as the alpha parameters for the Dirichlet distribution.

library(dirmult)
set.seed(9999)
dir1 <- as.data.frame(dirmult::rdirichlet(1000, alpha = c(pb1)))
dir2 <- as.data.frame(dirmult::rdirichlet(1000, alpha = c(pb2)))

colnames(dir1) <- c("A", "B", "C")
colnames(dir2) <- c("A", "B", "C")

dir1 <- melt(dir1)
dir2 <- melt(dir2)

As can be seen in the following figures, the Dirichlet distributions are basically the same.

library(ggplot2)
ggplot(dir1, aes(value, color = variable)) +
  geom_density()
ggplot(dir2, aes(value, color = variable)) +
  geom_density()

Dirichlet distribution from df1 Dirichlet distribution from df2

The problem is that this does not take into account the standard errors of the multinomial regression. In theory, more observations leads to lower standard errors, which should lead to larger alpha parameters. How can I incorporate the standard errors of the multinomial regression into the alpha parameters of the Dirichlet distribution? Is there a better way to estimate the parameters for a Dirichlet distribution based on simple categorical distributions?

The purpose of this is to create simulations of categorical distributions to estimate the confidence intervals of higher-order measures based on the categorical distributions. Let me know if there is another approach that should be used for this. I appreciate the help!

Upvotes: 1

Views: 178

Answers (1)

Severin Pappadeux
Severin Pappadeux

Reputation: 20120

While this is not a programming question, I'll try to give an answer.

If you recover Dirichlet alphas from some set of samples, and then use it, you'll find out that this is not a straightforward process.

Whatever values you sample later from Dirichlet would have a mean

E[Xi] = ai/a0, where a0 is a sum of alphas.

Easy to see that if you multiply all alphas by some arbitrary value/Multiplier, THERE IS NO CHANGE in means.

But! Variance of Var[Xi] would change as approximately

Varnew[Xi] ~ Var[Xi]/Multiplier.

This is where I think you should be going - find (using simple optimization technique) best fit for the multiplier which will match your original data.

Wiki reference: https://en.wikipedia.org/wiki/Dirichlet_distribution

Upvotes: 0

Related Questions