Bluerose
Bluerose

Reputation: 131

What is the problem of Gaussianmixture in the sklearn package in Python?

I am using Gaussianmixture (GM) of sklearn in python to identify the members of a star cluster. GM adjusted with two components and others are default. As seen in figure, one star (with red dots) seen that is clearly not a cluster member, appears as a member. Red dots gathered in the middle graph are porbably my members. But the single red dot on upper-left of this should not be a member. Because it is not close enough to this middle group.

My cluster image

My python code is

import numpy as np
from numpy import array
import pandas as pd
from sklearn.mixture import GaussianMixture

import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.colors as mtcolor

style.use("seaborn-white")
clist = ["gray", "red"]
cmap = mtcolor.ListedColormap(clist)

eX = pd.read_csv("mysatrs.csv", usecols=['col1', 'col2', 'col3']).values

col0m = (eX[:,0] >= -5) & (eX[:,0] <= 5)
col1m = (eX[:,1] >= -5) & (eX[:,1] <= 5)
col2m = (eX[:,2] > 0)

X = eX[col0m & col1m & col2m]

plt.figure(figsize=(6,6))

hcgmm = GaussianMixture(n_components=2)
gmmfit = hcgmm.fit(X)
gmmprd = gmmfit.predict(X)
hcprobs = gmmfit.predict_proba(X)
hcmns = hcgmm.means_

plt.scatter(X[:,0], X[:,1], c=gmmprd, s=3, cmap=cmap)
plt.show()

Should another adjustment be do for GM?

Upvotes: 1

Views: 1809

Answers (1)

Sam Mason
Sam Mason

Reputation: 16213

TLDR: the fit you get changes every time you run it, try a few times and keep the best one (the lowest hcgmm.bic()). your data seems to have three dimensions and I guess this other one is throwing things off (including a link or plotting it would help).

if someone wants a longer example, here's a MWE. first we pull in packages and generate some data:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture

background = np.random.randn(50,2) * 5
cluster = np.random.randn(50,2)

plt.scatter(background[:,0], background[:,1])
plt.scatter(cluster[:,0], cluster[:,1])

this gives something similar to the OP:

random draw

I can then fit a GMM as the OP does by doing:

gmm = GaussianMixture(n_components=2)
fit = gmm.fit(X)

and plot the results with:

plt.scatter(X[:,0], X[:,1], c=fit.predict(X))

but mostly I get terrible fits. But after running a few times, I get this:

fit with bic 993.5

which shows we can (at least sometimes) recover a reasonable partition. The BIC of this fit was 993.5 while I often get values of >1000 that visually look terrible.

if I run fit 1000 times with the above random data I can generate a CDF that looks like:

flipped CDF

(x/y axes are the wrong way around, x is probability and y is BIC value) which says that it'll recover a good partition about 10% of the time. Trying with other random draws says this varies, but I've not got better than ~40% success rate.

Given that you've only got a few points and components, you could try a much more computationally heavy approach. I'd expect that a Bayesian MCMC mixture model would do much better here.

I've just remembered that the Rand index is an easy way to check the accuracy of the partitions. We can generate lots of test data, fit it and get the BIC and Rand index by doing:

from sklearn.metrics import adjusted_rand_score

true_labels = (np.random.random(100) < 0.5).astype(int)
ix_a, = np.nonzero(true_labels == 0)
ix_b, = np.nonzero(true_labels == 1)

gmm = GaussianMixture(n_components=2)

def test():
    X = np.empty((len(true_labels), 2), float)
    X[ix_a,:] = np.random.randn(len(ix_a), 2) * 5
    X[ix_b,:] = np.random.randn(len(ix_b), 2)

    fit = gmm.fit(X)
    ari = adjusted_rand_score(true_labels, fit.predict(X))
    return fit.bic(X), ari

fits = np.array([test() for _ in range(1000)])

and then plot the resulting distribution:

plot of fits from above

which shows we get nothing useful 76% of the time. If we have 1000 data points (i.e. X has 1000 rows) then it recovers a reasonable partition most of the time. However if I draw the background from a "Uniform(-10, 10)" distribution, e.g. with:

background = np.random.rand(500,2) * 20 - 10
cluster = np.random.randn(500,2)

it again fails horribly (ARI<0.5 ~ 99.5%). Basically the GM model seems to deal very badly with non-Gaussian data.

Upvotes: 2

Related Questions