Camden Cheek
Camden Cheek

Reputation: 65

Get the distribution when a point is sampled from a mixture in PyMC3

I have a model with a pm.NormalMixture(), and when I sample from the normal mixture, I also want to know which of the mixed distributions that point is being sampled from.

import numpy as np
import pymc3 as pm

obs = np.concatenate([np.random.normal(5,1,100), 
                      np.random.normal(10,2,200)])
with pm.Model() as model:
    mu = pm.Normal('mu', 10, 10, shape=2)
    sd = pm.Normal('sd', 10, 10, shape=2)
    x = pm.NormalMixture('x', mu=mu, sd=sd, observed=obs)

I sample from that model, then use that trace to sample from the posterior predictive distribution, and what I want to know is for each x in the posterior predictive trace, which of the two normal distributions being sampled from it belongs to. Is that possible in PyMC3 without doing it manually?

Upvotes: 0

Views: 263

Answers (1)

Aorus
Aorus

Reputation: 138

This example demonstrates how posterior predictive checks (PPCs) work. The gist of a PPC is that you first draw random samples from the trace. The trace is essentially always multivariate, and in your model a single sample would be defined by the vector (mu[i,0], mu[i,1], sd[i,0], sd[i,1]). Then, for each trace sample, generate random numbers from the distribution specified for the likelihood with its parameter values equal to those from the trace samples. In your case, this would be NormalMixture(mu[i,:], sd[i,:]). In your model, x is the likelihood function, not an individual point of the trace.

Some practical notes:

  • You haven't specified a weighting variable, so I'm assuming by default it forces the normal distributions to be weighted equally (I haven't tested this).
  • The odds of a given point coming from one distribution or the other is just the ratio between the probability densities at that point.
  • Check out this for recommendations on how to choose priors. For example, your SD prior is placing a lot of weight on very large SDs, which would bias your results, especially for smaller datasets.

Good luck!

Upvotes: 1

Related Questions