Reputation: 91
I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?
Upvotes: 2
Views: 3112
Reputation: 91
The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.
Upvotes: 1