Reputation: 2653
I have a dataset that has 70 columns and 4.4 million rows. I want to perform clustering on it. I did TF-IDF first then I used clustering with K-means, Bisecting k-means and Gaussian Mixture Model (GMM). While the other techniques give me the specified number of clusters, GMM gives only one cluster. Example, in the code below, I want 20 clusters but it returns only 1 cluster. Is this happening because of the fact that I have many columns or it is merely caused by the nature of the data?
gmm = GaussianMixture(k = 20, tol = 0.000001, maxIter=10000, seed =1)
model = gmm.fit(rescaledData)
df1 = model.transform(rescaledData).select(['label','prediction'])
df1.groupBy('prediction').count().show() # this returns 1 row
Upvotes: 3
Views: 1064
Reputation: 21
In my opinion, the main reason behind of bad clustering performance of Pyspark GMM is that it's implementation is done using diagonal covariance matrix which do not take account of covariance between different features present within the dataset.
Check it's implementation here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
where they have cleary mentioned to be using diagonal covariance matrix because of curse of dimensionality.
@note This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.
Upvotes: 2