Using K-Means for document clustering, should clustering be on Cosine Similarity or on term vectors?

Apologies if the answer to this is obvious, please be kind, this is my first time on here :-)

I would gratefully appreciate it if someone could give me a steer on the appropriate input data structure for k-means. I am working on a master's dissertation in which I am proposing a new TF-IDF term weighing approach specific to my domain. I want to use k-means to cluster the results and then apply a number of internal and external evaluation criteria to see if my new term weighting method has any merit.

My steps so far that are all working (implemented in PHP):

Step 1: Read in document collection
Step 2: Clean document collection, feature extraction, feature selection
Step 3: Term Frequency (TF)
Step 4: Inverse Document Frequency (IDF)
Step 5: TF * IDF
Step 6: Normalise TF-IDF to fixed length vectors

Step 7 and 8 is where I need some guidance:

Step 7: Vector Space Model – Cosine Similarity

The only examples I can find, are comparing an input query to each document and finding the similarity. Where there is no input query (this is not an information retrieval system) do I compare every single document in the corpus with every other document in the corpus (every pair of documents)? I cannot find any example of Cosine Similarity applied to a full document collection rather than a single example/query compared to the collection.

Step 8: K-Means

I am struggling here to understand if the input vector for k-means should contain a matrix of the cosine similarity score of every document in the collection against every other document (a matrix of cosine similarity). Or is k-means supposed to be applied over a term vector model? If it is the latter, every example I can find of k-means is quite basic and plots either singular terms. How do I handle the fact that there are multiple terms in my document collection etc?

Cosine Similarity and K-Means are implied as the solution to document clustering on so many examples so I am missing something very obvious.

Upvotes: 6

Answers (5)

Eyes Charming

Reputation: 83

Look at .. Simple Search: The Vector Space Model

Upvotes: 0

Mubin Shrestha

Reputation: 398

Use TF-IDF to calculate the cosine similarity. Use cosine similarity scores as the input data for your clustering algorithm.

Upvotes: 0

Claire McMahon

Reputation: 61

If it helps anybody else out, I have found that it is possible to k-means clustering a multi-dimensional term vector but if more than 3 dimenions are included (which will be the case for any document collection), you cannot visualise it. I believe this is what threw me here, all of the examples I saw of k-means included a graph visualisation, this led me to believe, incorrectly ,that perhaps the source data for k-means was expected to be two dimensional, such as 0 and the cosine similarity. Thank you kindly for the respondents for your help, much appreciated.

Upvotes: 0

carence

Reputation: 87

I second Anony-Mousse opinion that you should reconsider PHP and would like to suggest Python as there several useful libraries for these kind of problems:

Numpy: a great and efficient package for scientific computing.

SciPy: Actually has several routines for k-means clustering: see here

Theano: For more machine learning needs, especially deep learning.

Also there is this great tutorial about the k means algorithm. It also supplies pseudo code in Python. You can use this and maybe an implementation done by yourself to understand the algorithm better but ultimately I would make use of the library mentioned above as they are optimized for performance which is definitely something to keep in mind if you have a big collection of documents.

Upvotes: 0

Has QUIT--Anony-Mousse

Reputation: 77485

K-means cannot operate on a similarity matrix.

Because k-means computes point-to-mean distances, not pairwise distances.

You need an implementation of spherical k-means if you want to use Cosine distance: at every iteration, the centers should be L2 normalized.

If I'm not mistaken, it should be equivalent to run k-means with cosine similarity, and only normalize the center to unit length at the end. But regular spherical k-means may be faster, because you can exploit data normalization to simplify cosine distance to the dot product.

You may want to reconsider using PHP. It is one of the worst possible choices for this type of programming task. It's good for interactive web page, but it doesn't shine at data analysis at all.

Upvotes: 0

Using K-Means for document clustering, should clustering be on Cosine Similarity or on term vectors?

Answers (5)

Related Questions