Text Documents Clustering - Non Uniform Clusters

Question

I have been trying to cluster a set of text documents. I have a sparse TFIDF matrix with around 10k documents (subset of a large dataset), and I try to run the scikit-learn k-means algorithm with different sizes of clusters (10,50,100). Rest all the parameters are default values.

I get a very strange behavior that no matter how many clusters I specify or even if I change the number of iterations, there would be 1 cluster in the lot which would contain most of the documents in itself and there will be many clusters which would have just 1 document in them. This is highly non-uniform behavior

Does anyone know what kind of problem am I running into?

apurva.nandan · Accepted Answer

I noticed with the help of above answers and comments that there was a problem with outliers and noise in original space. For this, we should use a dimensionality reduction method which eliminates the unwanted noise in the data. I tried random projections first but it failed to work with text data, simply because the problem was still not solved. Then using Truncated Singular Value Decomposition, I was able to get perfect uniform clusters. Hence, the Truncated SVD is the way to go with textual data in my opinion.

Text Documents Clustering - Non Uniform Clusters

Answers (2)

Related Questions