MorganM
MorganM

Reputation: 311

unexpected poor performance of GMM from sklearn

I'm trying to model some simulated data using the DPGMM classifier from scikitlearn, but I'm getting poor performance. Here is the example I'm using:

from sklearn import mixture
import numpy as np
import matplotlib.pyplot as plt
clf = mixture.DPGMM(n_components=5, init_params='wc')
s = 0.1
a = np.random.normal(loc=1, scale=s, size=(1000,))
b = np.random.normal(loc=2, scale=s, size=(1000,))
c = np.random.normal(loc=3, scale=s, size=(1000,))
d = np.random.normal(loc=4, scale=s, size=(1000,))
e = np.random.normal(loc=7, scale=s*2, size=(5000,))
noise = np.random.random(500)*8 
data = np.hstack([a,b,c,d,e,noise]).reshape((-1,1))
clf.means_ = np.array([1,2,3,4,7]).reshape((-1,1))
clf.fit(data)
labels = clf.predict(data)
plt.scatter(data.T, np.random.random(len(data)), c=labels, lw=0, alpha=0.2)
plt.show()

I would think that this would be exactly the kind of problem that gaussian mixture models would work for. I've tried playing around with alpha, using gmm instead of dpgmm, changing the number of starting components, etc. I can't seem to get a reliable and accurate classification. Is there something I'm just missing? Is there another model that would be more appropriate?

Upvotes: 0

Views: 2031

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Because you didn't iterate long enough for it to converge.

Check the value of

clf.converged_

and try increasing n_iter to 1000.

Note that, however, the DPGMM still fails miserably IMHO on this data set, decreasing the number of clusters to just 2 eventually.

Upvotes: 0

Related Questions