Reputation: 1571
Question was moved to stats.stackexchange
A Gaussian Mixture model is fitted by the Expectation-Maximization algorithm.
This fairly simple algorithm consists of two steps and the initialization.
Initialization (for k=2
Gaussians)
Find (or guess) an initial mu
and sigma
parameter value for both Gaussians.
E-step Use the current Gaussian parameters to estimate for each data point the likelihood that is comes from Gaussian A (or B). Dividing these likelihoods by the sum of both allows to obtain probabilities (or posteriors) that a point is rather from A than from B. Here prior information could be introduced, if the share of the underlying classes is known.
M-step
All mus
and sigmas
are updated. For example, new mu
value for Gaussian A can be calculated by the weighted sum of all data points. The weigths are the posteriors of the previous step. Similarly the sigmas
can be derived.
I wonder, how I could extend this algorithm to ignore a known share of data points. For example, I have the external knowledge that my dataset consists of p(A)=70%
and p(B)=15%
and another 15% of unknown species. I want to fit two Gaussians to describe the chestnut and sunflower distribution and want to allow the algorithm to ignore 15% of the data. This allows for smaller sigmas, as the remaining two Gaussians have no need to be stretched to cover data that is from another unknown distribution.
Note: Since the unknown species are possibly multiple species, I can not introduce a third Gaussian to "absorb" it. Also, the unknown species could be similarly distributed as A, so it is not necessarily the best thing to remove outliers.
So far, I use the known share of A and B as priors in the E-Step but I couldn't find an elegant way to ignore the unknown data. Any ideas? Thank you!
Example with 70% species A, 15% species B and 15% unknown
Upvotes: 0
Views: 167