Reputation: 853
I am trying to use the predicted probabilities from a multinomial regression using multinom
function from the nnet
package in R to estimate the parameters for a Dirichlet distribution while taking into account the standard errors. Below is an example of what I'm trying to do.
I have two sets of categorical data as show below. As you can see, they have the same ratios but the second dataset has 10 times more observations.
df1 <- data.frame(x=c(rep("A", 100), rep("B", 200), rep("C", 300)))
df2 <- data.frame(x=c(rep("A", 1000), rep("B", 2000), rep("C", 3000)))
I conduct a simple multinomial regression that only uses the intercept.
library(nnet)
mn1 <- multinom(x~1, data = df1)
mn2 <- multinom(x~1, data = df2)
summary(mn1)
summary(mn2)
The predicted probabilities are calculated based on the coefficients. Since I'm only using the intercept, we could simply look at the distribution of the data to calculate the probabilities, of course. But the reason I want to use the multinomial regression is to use the standard errors from the models.
softmax <- function(x) {
exp(x) / sum(exp(x))
}
pb1 <- softmax(c(0, coef(mn1)[,1]))
pb2 <- softmax(c(0, coef(mn2)[,1]))
pb1; pb2
Then, using the rdirichlet
function from the dirmult
package, I want to run simulations of the Dirichlet function that would derive from the multinomial regression. Here, I use the predicted probabilities as the alpha parameters for the Dirichlet distribution.
library(dirmult)
set.seed(9999)
dir1 <- as.data.frame(dirmult::rdirichlet(1000, alpha = c(pb1)))
dir2 <- as.data.frame(dirmult::rdirichlet(1000, alpha = c(pb2)))
colnames(dir1) <- c("A", "B", "C")
colnames(dir2) <- c("A", "B", "C")
dir1 <- melt(dir1)
dir2 <- melt(dir2)
As can be seen in the following figures, the Dirichlet distributions are basically the same.
library(ggplot2)
ggplot(dir1, aes(value, color = variable)) +
geom_density()
ggplot(dir2, aes(value, color = variable)) +
geom_density()
The problem is that this does not take into account the standard errors of the multinomial regression. In theory, more observations leads to lower standard errors, which should lead to larger alpha parameters. How can I incorporate the standard errors of the multinomial regression into the alpha parameters of the Dirichlet distribution? Is there a better way to estimate the parameters for a Dirichlet distribution based on simple categorical distributions?
The purpose of this is to create simulations of categorical distributions to estimate the confidence intervals of higher-order measures based on the categorical distributions. Let me know if there is another approach that should be used for this. I appreciate the help!
Upvotes: 1
Views: 178
Reputation: 20120
While this is not a programming question, I'll try to give an answer.
If you recover Dirichlet alphas from some set of samples, and then use it, you'll find out that this is not a straightforward process.
Whatever values you sample later from Dirichlet would have a mean
E[Xi] = ai/a0, where a0 is a sum of alphas.
Easy to see that if you multiply all alphas by some arbitrary value/Multiplier, THERE IS NO CHANGE in means.
But! Variance of Var[Xi] would change as approximately
Varnew[Xi] ~ Var[Xi]/Multiplier.
This is where I think you should be going - find (using simple optimization technique) best fit for the multiplier which will match your original data.
Wiki reference: https://en.wikipedia.org/wiki/Dirichlet_distribution
Upvotes: 0