David DeWert
David DeWert

Reputation: 91

Mixture of Gaussians using scikit learn mixture

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.

The data looks like this: enter image description here

So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:

data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")

And here's what I say when I try it with python:

from sklearn import mixture
import numpy as np
import os
import pyplot

os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)

classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()

Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?

Is there a difference in the models used by Mclust and by sklearn.mixture?

But more important: what is the best way in sklearn to cluster my data?

Upvotes: 2

Views: 3112

Answers (1)

David DeWert
David DeWert

Reputation: 91

The trick is to set GMM's min_covar. So in this case I get good results from:

mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)

The large default value for min_covar assigns all points to one cluster.

Upvotes: 1

Related Questions