Reputation: 37
Let's say I have a dataframe with 4 variable. I want to see if I can generate a posterior of gamma mixtures over all the variables, with the goal to find clusters for each observation. I'm guessing I will need some sort of multivariate gamma distribution? But how would I go about this?
Here is some pymc3 code as an example with one parameter, looking for a mixture of two gammas (I have chosen arbitrary parameters):
with pm.Model() as m:
p = pm.Dirichlet('p', a = np.ones(2))
alpha = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
beta = pm.Gamma('means',alpha = 1, beta = 1, shape = 2)
x = pm.Gammma('x', alpha, beta)
comp_dist = pm.Gamma.dist(means, scale, shape = (2,))
like = pm.Mixture('y', w = p,comp_dists = comp_dist, observed = data)
trace = pm.sample(1000)
So my question is, how would I extend this basic example to multiple variables? I assume that I need to define relationships between the variables somehow to encode them in the model? I feel that I understand the basics of mixture modelling, but at the same time feel that I am missing something pretty fundamental.
Upvotes: 2
Views: 972
Reputation: 76700
Here's how the multidimensional case should work:
J = 4 # num dimensions
K = 2 # num clusters
with pm.Model() as m:
p = pm.Dirichlet('p', a=np.ones(K))
alpha = pm.Gamma('alpha', alpha=1, beta=1, shape=(J,K))
beta = pm.Gamma('beta', alpha=1, beta=1, shape=(J,K))
gamma = pm.Gamma.dist(alpha=alpha, beta=beta, shape=(J,K))
like = pm.Mixture('y', w=p, comp_dists=gamma, observed=X, shape=J)
trace = pm.sample(1000)
where X.shape
should be (N,J)
.
The difficult part is going to be resolving identifiability issues, but I think that's beyond the scope of the question. Maybe have a look at how the GMM tutorial breaks symmetry using the pm.Potential
function. I expect highly-correlated parameterizations of the likelihood function(s), like alpha
and beta
, would exacerbate the issue, so perhaps consider switching to the mu
and sigma
parameterization.
Upvotes: 2