Reputation: 131
I am using Gaussianmixture (GM) of sklearn in python to identify the members of a star cluster. GM adjusted with two components and others are default. As seen in figure, one star (with red dots) seen that is clearly not a cluster member, appears as a member. Red dots gathered in the middle graph are porbably my members. But the single red dot on upper-left of this should not be a member. Because it is not close enough to this middle group.
My python code is
import numpy as np
from numpy import array
import pandas as pd
from sklearn.mixture import GaussianMixture
import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.colors as mtcolor
style.use("seaborn-white")
clist = ["gray", "red"]
cmap = mtcolor.ListedColormap(clist)
eX = pd.read_csv("mysatrs.csv", usecols=['col1', 'col2', 'col3']).values
col0m = (eX[:,0] >= -5) & (eX[:,0] <= 5)
col1m = (eX[:,1] >= -5) & (eX[:,1] <= 5)
col2m = (eX[:,2] > 0)
X = eX[col0m & col1m & col2m]
plt.figure(figsize=(6,6))
hcgmm = GaussianMixture(n_components=2)
gmmfit = hcgmm.fit(X)
gmmprd = gmmfit.predict(X)
hcprobs = gmmfit.predict_proba(X)
hcmns = hcgmm.means_
plt.scatter(X[:,0], X[:,1], c=gmmprd, s=3, cmap=cmap)
plt.show()
Should another adjustment be do for GM?
Upvotes: 1
Views: 1809
Reputation: 16213
TLDR: the fit you get changes every time you run it, try a few times and keep the best one (the lowest hcgmm.bic()
). your data seems to have three dimensions and I guess this other one is throwing things off (including a link or plotting it would help).
if someone wants a longer example, here's a MWE. first we pull in packages and generate some data:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
background = np.random.randn(50,2) * 5
cluster = np.random.randn(50,2)
plt.scatter(background[:,0], background[:,1])
plt.scatter(cluster[:,0], cluster[:,1])
this gives something similar to the OP:
I can then fit a GMM as the OP does by doing:
gmm = GaussianMixture(n_components=2)
fit = gmm.fit(X)
and plot the results with:
plt.scatter(X[:,0], X[:,1], c=fit.predict(X))
but mostly I get terrible fits. But after running a few times, I get this:
which shows we can (at least sometimes) recover a reasonable partition. The BIC of this fit was 993.5 while I often get values of >1000 that visually look terrible.
if I run fit
1000 times with the above random data I can generate a CDF that looks like:
(x/y axes are the wrong way around, x is probability and y is BIC value) which says that it'll recover a good partition about 10% of the time. Trying with other random draws says this varies, but I've not got better than ~40% success rate.
Given that you've only got a few points and components, you could try a much more computationally heavy approach. I'd expect that a Bayesian MCMC mixture model would do much better here.
I've just remembered that the Rand index is an easy way to check the accuracy of the partitions. We can generate lots of test data, fit it and get the BIC and Rand index by doing:
from sklearn.metrics import adjusted_rand_score
true_labels = (np.random.random(100) < 0.5).astype(int)
ix_a, = np.nonzero(true_labels == 0)
ix_b, = np.nonzero(true_labels == 1)
gmm = GaussianMixture(n_components=2)
def test():
X = np.empty((len(true_labels), 2), float)
X[ix_a,:] = np.random.randn(len(ix_a), 2) * 5
X[ix_b,:] = np.random.randn(len(ix_b), 2)
fit = gmm.fit(X)
ari = adjusted_rand_score(true_labels, fit.predict(X))
return fit.bic(X), ari
fits = np.array([test() for _ in range(1000)])
and then plot the resulting distribution:
which shows we get nothing useful 76% of the time. If we have 1000 data points (i.e. X
has 1000 rows) then it recovers a reasonable partition most of the time. However if I draw the background
from a "Uniform(-10, 10)" distribution, e.g. with:
background = np.random.rand(500,2) * 20 - 10
cluster = np.random.randn(500,2)
it again fails horribly (ARI<0.5 ~ 99.5%). Basically the GM model seems to deal very badly with non-Gaussian data.
Upvotes: 2