Reputation: 9149
I'm trying to implement k-means for text clustering, specifically English sentences. So far I'm at the point where I have a term frequency matrix for each document (sentence). I'm a little confused on the actual implementation of k-means on text data. Here's my guess of how it should work.
Figure out the number of unique words in all sentences (a large number, call it n
).
Create k
n
dimensional vectors (clusters) and fill in the values of the k
vectors with some random numbers (how do I decide what the bounds for these numbers are?)
Determine the Euclidean distance from each of the q
sentences to the random k
clusters, reposition clusters, etc. (If n
is very large like the English language, wouldn't calculating the Euclidean distance for these vectors be very costly?)
Thanks for any insight!
Upvotes: 0
Views: 1524
Reputation: 5441
resurrecting a slightly old question here, but worth linking the two...
Generally, you'd use some kind of locally-sensitive hashing instead of relying on frequency of word occurrence. But either way, manually assembling the feature matrix is a huge hassle.
This SO answer gives you a guide to how to create that feature matrix from a list of documents, using scikit-learn and explaining the steps. I think it'll help show you the sequence of steps required.
Upvotes: 1
Reputation: 1269503
This is a bit long for a comment.
If you have a document term matrix, then find the principal components (of the covariance matrix). Determine the coefficients of the original data in the principal component space. You can do k-means clustering in this space.
With text data, you generally need a bunch of dimensions -- 20, 50, 100, or even more. Also, I would recommend Gaussian mixture models/expectation-maximization clustering instead of k-means, but that is another story.
Upvotes: 1